## Abstract

High-dimensional biomarker data are often collected in epidemiological studies when assessing the association between biomarkers and human disease is of interest. We develop a latent class modeling approach for joint analysis of high-dimensional semicontinuous biomarker data and a binary disease outcome. To model the relationship between complex biomarker expression patterns and disease risk, we use latent risk classes to link the 2 modeling components. We characterize complex biomarker-specific differences through biomarker-specific random effects, so that different biomarkers can have different baseline (low-risk) values as well as different between-class differences. The proposed approach also accommodates data features that are common in environmental toxicology and other biomarker exposure data, including a large number of biomarkers, numerous zero values, and complex mean–variance relationship in the biomarkers levels. A Monte Carlo EM (MCEM) algorithm is proposed for parameter estimation. Both the MCEM algorithm and model selection procedures are shown to work well in simulations and applications. In applying the proposed approach to an epidemiological study that examined the relationship between environmental polychlorinated biphenyl (PCB) exposure and the risk of endometriosis, we identified a highly significant overall effect of PCB concentrations on the risk of endometriosis.

## INTRODUCTION

In many epidemiological studies, it is of interest to study the association between biological markers and a binary disease outcome. When only a few biomarkers are considered, traditional statistical modeling strategies can be directly used. Challenges arise, however, when a large number of biomarkers are involved. This paper proposes joint latent class models for high-dimensional biomarker data where numerous biomarkers and disease prevalence are linked through underlying latent classes of disease risk.

As an example, we consider an epidemiological study that assessed the relationship between exposures to 62 individual polychlorinated biphenyl (PCB) congeners (chemicals with a similar structure) and endometriosis, an estrogen-dependent gynecological disease (Buck Louis *and others*, 2005). These PCBs can be divided into 3 groups: estrogenic PCBs, antiestrogenic PCBs, and other PCBs. The study recruited a cohort of 79 eligible participants (between 18 and 40 years of age) with serum measurements who underwent laparoscopy. Participants were classified as having endometriosis or not based on laparoscopic visualization. For every participant, the concentration of each of the 62 PCB congeners was quantified by dual-column gas chromatography with electron capture from her serum specimens. The primary goal of the study was to investigate the relationship between exposure to PCB congeners and the risk of endometriosis. The top and bottom panels in Figure 1 show boxplots for each of 62 PCBs for subjects with and without endometriosis, respectively. Differences between the 2 profiles appear subtle and may be complex. For example, out of 62 PCB markers, approximately 82% have higher median exposure levels for subjects with endometriosis as compared with the subjects without endometriosis. Therefore, it is desirable to develop a flexible approach, which can identify potentially complex associations between high-dimensional PCB exposures and endometriosis diagnosis.

Several challenges arise in analyzing these data. (1) “High dimensionality of biomarker data.” With 62 PCB congeners and a relatively small number of participants (*I* = 79), it may be difficult to assess the complex relationship between PCB concentrations and the risk of endometriosis with traditional statistical techniques such as logistic regression. (2) “Collective effect of numerous biomarkers.” PCB congeners may act on the risk of endometriosis in complex ways. For example, small changes in all or many PCBs may act simultaneously in increasing the risk of endometriosis. Alternatively, larger changes in a few PCBs may influence risk. It is desired to develop methodology that allows for these flexible association structures in risk modeling. (3) “Zero exposure level.” For many of the PCBs, there is a sizable proportion of zero values. These values can result from either true zero concentrations of PCB congeners or PCB values that are too small to detect in the laboratory. Moreover, the occurrence of zero PCB values is potentially related to the means of the positive values. Figure 2A shows the rate of observing nonzero PCB concentrations versus their means. It suggests that the probability of observing nonzero exposures increases with the nonzero PCB means in a linear fashion. (4) “Variance depending on mean.” Figure 2B shows the standard deviations (SDs) of the observed nonzero PCB concentration levels versus their means. It suggests that the PCB SDs are related in a linear fashion to their corresponding means. These features of the data need to be appropriately accounted for in the analysis.

In this paper, we propose joint latent class models with random effects to account for the aforementioned data analysis challenges. In particular, we consider a class of joint latent class models for associating the pattern of high-dimensional exposure data with the risk of disease through the incorporation of latent classes. Specifically, patterns in PCB exposures and risk of endometriosis are associated through assumed latent classes, which are shared between the 2 modeling components. The latent class structure allows for the flexible identification of important patterns of exposure, which can place women at high risk for endometriosis. The formulation of the joint latent class models for semicontinuous biomarkers (i.e. continuous value with numerous zeros) link the patterns in the biomarker and the risk of disease through unobserved latent classes. The pattern in the mean biomarker values varies by class and is allowed to change across biomarkers by the incorporation of random effects (Laird and Ware, 1982). Our work has close connection to the latent class models discussed in Bartholomew (1987) and Bandeen-Roche *and others* (1997). These models, which are recognized as categorical variable analogs to factor analysis, postulate the joint distribution of multiple observed discrete variables as a mixture of a few distributions conditional on the value of a latent categorical variable. However, our model is for the disease biomarker data where the disease indicator is binary and the biomarker levels are all semicontinuous. Another related approach is the finite Gaussian mixture model discussed in Fraley and Raftery (2002) and McLachlan and Peel (2002). The finite Gaussian mixture model assumes that continuous data follow a finite normal mixture model with each component corresponding to a latent cluster or latent class. This model can be viewed as a latent variable model since the unobserved categorical variable characterizes classification of observations into clusters. But unlike the proposed latent class joint model, the finite Gaussian mixture model is a model-based clustering technique for biomarkers where clustering is not informed by the disease outcome. Our work is also related to Lin *and others* (2002) who used latent classes to link longitudinal biomarker data and event times.

The remainder of the paper is organized as follows. Section 2 presents the joint latent class models for high-dimensional semicontinuous biomarker data and the binary disease outcome. Section 3 derives the log likelihood of the proposed models and discusses strategies for estimation. In Section 4, we apply the proposed model to data from a study examining the effects of environmental PCB concentrations on the risk of endometriosis. In Section 5, simulation studies are performed to demonstrate the performance of the estimation and model selection methods. Finally, we conclude in Section 6.

## JOINT LATENT CLASS MODELS

### Joint latent class models with 2 latent classes

Let *Y*_{i} be a binary variable representing disease status for participant *i*, *i* = 1,2,…,*I*, where *Y*_{i} = 1 indicates the *i*th participant has disease and *Y*_{i} = 0 indicates no disease. Denote *X*_{ij} as the level of the *j*th biomarker measured from the *i*th participant, *j* = 1,2,…,*J*. In the motivating example of PCB concentrations and endometriosis, *Y*_{i} represents endometriosis status of the *i*th participant and *X*_{ij} represents the concentration level of the *j*th PCB congener measured from the *i*th participant. Let *L*_{i} denote the latent disease risk class for the *i*th participant. We first consider the case with only 2 latent classes: low-risk class, denoted by *L*_{i} = 0, and high-risk class, denoted by *L*_{i} = 1. In Section 2.2, we extend the model to the case of more than 2 classes using an ordinal latent variable. Let *π*_{1} = *P*(*L*_{i} = 1) be the probability of a participant belonging to the high-risk class and let *π*_{0} = 1 − *π*_{1}. The probability of disease given the latent disease class is modeled as:

*β*

_{0}and

*β*

_{1}are regression coefficients characterizing the relationship between the latent variable and the disease risk. In (2.1), the coefficient

*β*

_{1}is the positive increase of the log odds of disease prevalence from low-risk to high-risk class, and the positivity constraint ensures model identifiability related to the “label switching” problem common in latent class modeling.

Each biomarker level *X*_{ij}, conditional on the latent class, is treated as a semicontinuous measurement (Olsen and Schafer, 2001) and can be denoted by 2 variables:

*U*

_{ij}is the binary nonzero biomarker value indicator and

*V*

_{ij}is the value of the biomarker when it is nonzero. We model

*V*

_{ij}using the normal distribution:

*μ*

_{ij}(

*L*

_{i},

**b**

_{j}) =

*E*(

*V*

_{ij}|

*L*

_{i},

**b**

_{j}) is the conditional mean of

*V*

_{ij}that depends on latent class indicator

*L*

_{i}and random effects vector

*b*

_{j}, and

*g*

^{2}is a variance function that depends on

**ξ**, a vector of variance parameters. In (2.7), we allow the variance of nonzero biomarker values to depend on the mean biomarker value through the variance function

*g*

^{2}(

*μ*

_{ij}(

*L*

_{i},

*b*

_{j}),

*ξ*). In the application in Section 4, the SD of PCB concentration is proportional to PCB mean (see Figure 2B). Thus, in this case, it is reasonable to let

*g*

^{2}(

*μ*

_{ij}(

*L*

_{i},

*b*

_{j}),

*ξ*) =

*ξ*

^{2}

*μ*

_{ij}

^{2}(

*L*

_{i},

*b*

_{j}).

For the mean structure of each biomarker, we assume

*α*

_{0}and

*α*

_{1}are parameters defining the mean levels of the biomarkers in the low- and high-risk groups, and is a random effects vector that follows a multivariate normal distribution. The inclusion of the latent classes

*L*

_{i}allows the multiple biomarkers from the same participant to be correlated, while the random effects

*b*

_{j0}and

*b*

_{j1}allow for individual biomarkers to deviate from the overall mean. If the 2 random effects degenerate, or equivalently

*σ*

_{0}

^{2}=

*σ*

_{1}

^{2}=

*ρ*= 0, then

*μ*

_{ij}(

*L*

_{i},

*b*

_{j}) =

*α*

_{0}+

*α*

_{1}

*L*

_{i}, which means the average concentration of each biomarker is

*α*

_{0}for participants in the low-risk class and

*α*

_{0}+

*α*

_{1}for participants in the high-risk class (see Figure 3A). If

*σ*

_{0}

^{2}≠0 and

*σ*

_{1}

^{2}=

*ρ*= 0, then

*μ*

_{ij}(

*L*

_{i},

*b*

_{j}) =

*α*

_{0}+

*α*

_{1}

*L*

_{i}+

*b*

_{j0}. In this case, there is a biomarker-specific shift

*b*

_{j0}in the average concentration of each biomarker, but the difference in biomarker between classes is the same across biomarker (see Figure 3B). With 2 nondegenerated random effects

*b*

_{j0}and

*b*

_{j1}, or equivalently

*σ*

_{0}

^{2}≠0 and

*σ*

_{1}

^{2}≠0, we allow both biomarker-specific baseline shift

*b*

_{j0}and biomarker-specific difference

*α*

_{1}+

*b*

_{j1}between the 2 classes to vary across the

*J*biomarkers (see Figure 3C). When

*α*

_{1}is large and

*σ*

_{1}

^{2}is small, the modeling approach corresponds to the situation where cumulative PCB effects transmit risk. In this case, a simple logistic regression model where the independent variable is the sum of PCB exposures describes the association. However, when

*α*

_{1}is small and

*σ*

_{1}

^{2}is large, the association is complex and a simple cumulative logistic model will miss this association. We illustrate this with a simulation study in the supplementary material available at

*Biostatistics*online.

Finally, the model for the positive biomarker indicator *U*_{ij} is

*h*(

*μ*

_{ij}(

*L*

_{i},

**b**

_{j})) can be specified according to data features. The inclusion of

*μ*

_{ij}(

*L*

_{i},

**b**

_{j}) in function

*h*allows for a flexible yet parsimonious way to model the probability of a positive biomarker concentration as a function of the mean of the nonzero biomarker values. The parameter

*η*

_{1}and function

*h*in (2.9) characterize the relationship between the occurrence of a positive measurement and the mean of the measurement given that it is positive. In our motivating example, there is a positive linear relationship between the occurrence of a positive value and the mean of these values (see Figure 2A). This motivates the specification

*h*(

*μ*

_{ij}(

*L*

_{i},

**b**

_{j})) =

*μ*

_{ij}(

*L*

_{i},

**b**

_{j}).

Equations (2.1–2.9) characterize joint latent class models with random effects for high-dimensional biomarker data and disease prevalence, where the disease prevalence and biomarker levels are linked and jointly modeled through binary latent classes among participants. In the next section, we extend the model to the situation in which the latent risk classes are ordinal.

### Joint latent class models with multiple latent classes

When more than 2 latent classes are considered, a variety of parameterizations could be pursued to extend the proposed 2-class model. A natural specification would be to treat each latent class as a categorical variable and to treat the latent classes as dummy variables in the disease prevalence model (2.1), the mean structure model (2.8), and the model for a positive biomarker value (2.9). This would require introducing fixed and random effects of dimension equal to the number of latent classes in each of the model components. Although rich and flexible, this modeling strategy is complicated and computationally demanding. Further, this approach does not exploit the inherent ordinal nature of disease severity. As a result, we build a parsimonious extension by treating *L*_{i} = 0,1,…,*K* − 1 as an ordinal variable in (2.1–2.5). In essence, we model both the logit of disease prevalence and biomarker levels with linear functions of the ordinal latent class variable *L*_{i}. For example, if 3 latent classes are considered, the model then yields probabilities *P*(*Y*_{i} = 1|*L*_{i} = 0) = exp(*β*_{0})/[1 + exp(*β*_{0})], *P*(*Y*_{i} = 1|*L*_{i} = 1) = exp(*β*_{0} + *β*_{1})/[1 + exp(*β*_{0} + *β*_{1})], and *P*(*Y*_{i} = 1|*L*_{i} = 2) = exp(*β*_{0} + 2*β*_{1})/[1 + exp(*β*_{0} + 2*β*_{1})]. These correspond to the disease prevalence of low-, middle-, and high-risk classes, respectively. Similarly, the means of biomarker levels for the 3 latent classes are *α*_{0}, *α*_{0} + *α*_{1}, and *α*_{0} + 2*α*_{1}, respectively. We focus on this parameterization to achieve a parsimonious modeling framework for the case of multiple disease classes.

### Covariates for joint latent class models

Participant-specific or biomarker-specific covariates can be incorporated into the disease probability (2.1) and biomarker distribution (2.7) as follows: *β*_{1} > 0; and *V*_{ij}|*L*_{i},**b**_{j},**Z**_{ij}),*ξ*)),

**W**

_{i}is a vector of participant-specific covariates that may affect disease prevalence and

**Z**

_{ij}is a vector of variance covariates whose components are either participant specific or biomarker specific. In our example described in Section 1, there are 3 groups of biomarkers among the 62 PCB congeners: estrogenic PCBs, antiestrogenic PCBs, and other PCBs. Thus, in Section 4, we incorporate the PCB grouping information into (2.11) with

**Z**

_{ij}comprising 2 indicator variables.

Covariates can also be introduced into (2.5):

**T**

_{ij}and parameter vector

*ζ*.

## MAXIMUM LIKELIHOOD ESTIMATION

### The likelihood

The proposed joint latent class model (with extended covariates in Section 2.3) contains unknown parameters: *β*_{0}, *β*_{1}, *γ* in (2.10), *ξ*, *α*_{0}, *α*_{1}, *σ*_{0}, *σ*_{1}, *ρ* in (2.11) and (2.8), and *η*_{0}, *η*_{1}, *ζ* in (2.12). If we define the probability that a participant belongs to the *k*th class as *π*_{k} = *P*(*L*_{i} = *k*), *k* = 0,1,…,*K* − 1, then the components of *π* = (*π*_{0},*π*_{1},…,*π*_{K − 1})^{′} are also unknown parameters. Let *y*_{i}, *u*_{ij}, *v*_{ij}, **w**_{i}, **z**_{ij}, and *t*_{ij} be the observed values of *Y*_{i}, *U*_{ij}, *V*_{ij}, **W**_{i}, **Z**_{ij}, and **T**_{ij}, respectively, and let *θ* = (*β*_{0},*β*_{1},*γ*^{′},*ξ*^{′},*α*_{0},*α*_{1},*σ*_{0},*σ*_{1},*ρ*,*η*_{0},*η*_{1},*π*^{′},*ζ*^{′})^{′}. Note that the latent variable *L*_{i} is crossed with the random effects **b**_{j} in the proposed model. Therefore, the likelihood of the proposed model is given by

*f*

_{Vij|Li,bj}is the probability density function of a univariate normal distribution with mean

*μ*

_{ij}(

*L*

_{i},

**b**

_{j}) and variance

*σ*

^{2}

*g*

^{2}(

*μ*

_{ij}(

*L*

_{i},

**b**

_{j},

**z**

_{ij}),

*ξ*) as in (2.7),

*μ*

_{ij}(

*L*

_{i},

**b**

_{j}) =

*α*

_{0}+

*α*

_{1}

*L*

_{i}+

*b*

_{j0}+

*b*

_{j1}

*L*

_{i}as in (2.8), and

*f*

_{b}is the bivariate normal distribution probability density function of

**b**

_{j}as described in Section 2.

### Monte Carlo EM algorithm

We consider an expectation-maximization (EM) algorithm (Dempster *and others*, 1977) as follows. We treat the latent class variable **L** = (*L*_{1},…,*L*_{I})^{′} and the random effects **b** = (**b**_{1},…,**b**_{J})^{′} as missing data in the EM algorithm. We denote **X**_{obs} as the observed data, **Z**_{mis} = (**L**,**b**)^{′} as the missing data, and let **Y**_{com} = (**X**_{obs},**Z**_{mis})^{′}. At the (*t* + 1)th iteration of the EM algorithm, the E-step involves calculation of the *Q*-function *Q*(*θ*|*θ*^{(t)}) = *E*[log*f*(**Y**_{com}|*θ*)|**x**_{obs},*θ*^{(t)}] = ∫log*f*(**y**_{com}|*θ*)*f*(**z**_{mis}|**x**_{obs},*θ*^{(t)})* d***z**_{mis}, where *θ*^{(t)} denotes the parameter value from the *t*th iteration, *f*(**z**_{mis}|**x**_{obs},*θ*^{(t)}) is the conditional distribution of (**L**,**b**) given the observed data and *θ*^{(t)}, and *f*(**y**_{com}|*θ*) is the full likelihood. The M-step involves maximizing *Q*(*θ*|*θ*^{(t)}) with respect to *θ* to yield the new update *θ*^{(t + 1)}. The process is iterated from a starting value *θ*^{(0)} to convergence. Under regularity conditions, the value at convergence maximizes (3.1).

The dimensionality of integration and summations in likelihood function (3.1) increases with the number of study participants and biomarkers. As a result, the integration in the E-step is intractable with numerical integration techniques. Therefore, we conduct estimation using Monte Carlo EM (MCEM) algorithm (Wei and Tanner, 1990). In this algorithm, we form the Monte Carlo approximation to the required expectation in *Q*(*θ*|*θ*^{(t)}). Booth and Hobert (1999) proposed using a rejection sampling or an importance sampling scheme. Their method produces independent and identically distributed samples that may be used to assess Monte Carlo error at each iteration and hence suggest a rule for changing the sample size to enhance speed. However, their method may break down when the intractable integrals in the likelihood function are of high dimension. An alternative approach is to use the Metropolis–Hastings algorithm, as in McCulloch (1997), to sample from the conditional distribution using the density of latent variables as the proposal distribution. Here, we use the Metropolis–Hastings algorithm but with an adaptive proposal distribution. To generate *N* values (**l**^{(n)},**b**^{(n)}), *n* = 1,2,…,*N*, from the conditional distribution *f*(**l**,**b**|**x**_{obs},*θ*^{(t)}) using the Metropolis–Hastings algorithm, we start from the starting values (**l**^{(0)},**b**^{(0)}) with *f*(**l**^{(0)},**b**^{(0)}) > 0, and then, at the *n*th Metropolis–Hastings step, sample a candidate (**l**^{*},**b**^{*}) from $f~(t)(l,b)$. Let (**l**^{(n + 1)},**b**^{(n + 1)}) = (**l**^{*},**b**^{*}) with probability min(1,*R*), where $R=f(l*,b*|xobs,\theta (t))f~(t)(l(n),b(n))/f(l(n),b(n)|x o b s,\theta (t))f~(t)(l*,b*)$, and (**l**^{(n + 1)},**b**^{(n + 1)}) = (**l**^{(n)},**b**^{(n)}) otherwise. The proposal distribution at the *t*th iteration of the EM algorithm is constructed to be , where is the empirical distribution of *l*_{i}'s sampled from the (*t* − 1)th MCEM iteration and is the product of 2 Student's *t*-distribution centered, respectively, at the means of *b*_{j0}'s and *b*_{j1}'s sampled from the (*t* − 1)th MCEM iteration (Gelman *and others*, 1995). The proposal distribution presented here dynamically incorporates the obtained information of conditional distribution *f*(**l**,**b**|**x**_{obs},*θ*^{(t)}) from the previous MCEM step and thus greatly improves the performance of the Metropolis–Hastings Markov chain. The starting values (**l**^{(0)},**b**^{(0)}) of the Metropolis–Hastings algorithm are chosen to be the the M-step estimates of the previous step of the MCEM algorithm. We run 100 steps in the EM algorithm and 10 000 Monte Carlo iterations in each MCEM step. To incorporate the constraint *β*_{1} > 0 in the MCEM algorithm, we let *β*_{1} = *e*^{β1*} with − *∞* < *β*_{1}^{*} < + *∞* and implement the MCEM algorithm with respect to the new parameter *β*_{1}^{*}.

### Empirical Bayes estimates of the random effects

Our estimation procedure consists of 2 steps: frequentist maximum likelihood estimation described in Section 3.2 for the unknown fixed parameters and empirical Bayes estimation for the random effects **b**_{j} in this section. The maximum likelihood estimation procedure is not Bayesian since no priors are assigned to the fixed effects. In obtaining the empirical Bayes estimates of the random effects, the random effects distributions are treated as priors. This is commonly done in the literature of mixed effects models; see Laird and Ware (1982) for reference.

Viewing the bivariate normal distribution of **b**_{j} as a prior distribution, approximate inferences for **b**_{j} can be obtained by calculating a posterior mean *E*(**b**_{j}|*θ*,**x**_{obs}) and covariance matrix var(**b**_{j}|*θ*,**x**_{obs}) with the unknown parameter vector *θ* replaced by its maximum likelihood estimates $Tnqbold\theta ^$. Consider the posterior expectation of some function *ℳ* of **b**_{j}, which are of the form *E*(*ℳ*(**b**_{j})|*θ*,*x*_{obs}) = ∫*ℳ*(**b**_{j})*p*(**b**_{j}|*θ*,**x**_{obs})* d***b**_{j}. The approximate inference for **b**_{j} can be carried out by calculating *E*(*ℳ*(**b**_{j})|*θ*,**x*** o** b** s*) with some form of *ℳ*. For instance, *ℳ*(**b**_{j}) = **b**_{j} results in the posterior mean *E*(**b**_{j}|*θ*,**x**_{obs}). The calculation is simply a by-product of the MCEM algorithm.

## AN ANALYSIS OF THE ENDOMETRIOSIS AND PCB CONCENTRATION DATA

We fit the joint latent class models to data from the epidemiological study that examines the effect of environmental PCB concentrations on the risk of endometriosis. Details of the study have been previously reported in Buck Louis *and others* (2005) and have been summarized in Section 1. The primary interests of the data analysis are in examining the association between complex patterns of PCB concentrations and the risk of endometriosis. Using the methodology developed in the previous sections, we fit a series of models with different numbers of latent classes. As discussed in Section 2, there is evidence to suggest *h*(*μ*_{ij}(*L*_{i},**b**_{j},**Z**_{ij}),**T**_{ij},*ζ*) = *μ*_{ij}(*L*_{i},**b**_{j},**Z**_{ij}) (Figure 2A) and *g*^{2}(*μ*_{ij}(*L*_{i},**b**_{j},**Z**_{ij}),*ξ*) = *ξ*^{2}*μ*_{ij}^{2}(*L*_{i},**b**_{j},**Z**_{ij}) (Figure 2B) when fitting the models. Two dummy variables *I*_{j,e} and *I*_{j,a} are created to represent estrogenic PCBs (*I*_{j,e} = 1 if a PCB is estrogenic; *I*_{j,e} = 0 otherwise) and antiestrogenic PCBs (*I*_{j,a} = 1 if a PCB is antiestrogenic; *I*_{j,a} = 0 otherwise), respectively. Thus, the PCB mean structure is:

The MCEM algorithm was implemented as described in Section 3 and standard errors (SEs) were estimated using the nonparametric bootstrap method with 500 bootstrap samples (Efron and Tibshirani, 1993).

Table 1 shows parameter estimates and SEs of 4 random effects latent class models, with the number of latent classes being 2, 3, 4, and 5, respectively. The table also reports Bayesian information criterion (BIC) type IC_{Q} values (see Ibrahim *and others*, 2008 and the supplementary material available at *Biostatistics* online) for the 4 candidate models. The model selection criterion IC_{Q} chose the 3-class model as the best among the competing ones. This model divides participants into 3 disease classes according to their risks with respect to endometriosis: 71.40% of the participants belonged to the low-risk group, 25.27% to the middle-risk group, and 3.33% to the high-risk group. The endometriosis prevalences in the 3 groups were estimated as 30.42%, 53.16%, and 74.66%, respectively. Since *β*_{1} was restricted to be positive, we performed a one-sided Wald test of *H*_{0}:*β*_{1} = 0 versus *H*_{1}:*β*_{1} > 0 and obtained a *p* value of 0.0186. Tests of whether *α*_{1}≠0, *α*_{1} + *α*_{1,e}≠0, or *α*_{1} + *α*_{1,a}≠0 examine whether the mean biomarker values vary with latent risk class in the other, estrogenic, and antiestrogenic PCBs, respectively; these 3 tests were all highly significant (less than 0.0001). These tests indicate that the PCB concentration pattern is associated with the risk of endometriosis, and that the overall means of PCBs are significantly different between classes. The parameter estimates $\eta ^0=\u22120.8892$ and $\eta ^1=42.8010$ provide information on the probabilities of observing zero PCB concentration levels. The positive *η*_{1} estimate indicates that, for a given PCB and participant, the probability of observing a positive PCB level increases with the mean. We found a statistically significant relationship between the pattern of PCB concentrations and the risk of endometriosis using the proposed model. We now estimate features of the pattern in each of the risk groups.

Joint latent class models with | ||||

Two classes | Three classes | Four classes | Five classes | |

Estimates (SEs) | ||||

β_{0} | – 0.8052_{(0.3329)} | – 0.8272_{(0.3757)} | – 1.2187_{(0.3929)} | – 1.2587_{(0.4661)} |

β_{1} | 0.8401_{(0.4054)} | 0.9539_{(0.4579)} | 0.8342_{(0.3652)} | 0.6480_{(0.2779)} |

α_{0} | 0.0159_{(0.0024)} | 0.0152_{(0.0028)} | 0.0130_{(0.0023)} | 0.0117_{(0.0023)} |

α_{0, e} | 0.0090_{(0.0014)} | 0.0046_{(0.0008)} | 0.0042_{(0.0008)} | 0.0025_{(0.0005)} |

α_{0, a} | – 0.0043_{(0.0007)} | – 0.0048_{(0.0005)} | – 0.0053_{(0.0010)} | – 0.0054_{(0.0010)} |

α_{1} | 0.0289_{(0.0045)} | 0.0218_{(0.0042)} | 0.0156_{(0.0024)} | 0.0132_{(0.0026)} |

α_{1, e} | 0.0177_{(0.0020)} | 0.0075_{(0.0009)} | 0.0077_{(0.0013)} | 0.0079_{(0.0014)} |

α_{1, a} | – 0.0075_{(0.0010)} | – 0.0067_{(0.0012)} | – 0.0036_{(0.0006)} | – 0.0027_{(0.0003)} |

ξ | 0.7263_{(0.0860)} | 0.8361_{(0.1114)} | 0.8800_{(0.0845)} | 0.9013_{(0.1147)} |

σ_{0} | 0.0088_{(0.0011)} | 0.0063_{(0.0012)} | 0.0053_{(0.0010)} | 0.0048_{(0.0007)} |

σ_{1} | 0.0198_{(0.0033)} | 0.0127_{(0.0021)} | 0.0094_{(0.0015)} | 0.0087_{(0.0013)} |

ρ | 0.7788_{(0.0749)} | 0.7259_{(0.1195)} | 0.6599_{(0.1148)} | 0.6464_{(0.1240)} |

η_{0} | – 0.8556_{(0.0750)} | – 0.8892_{(0.1625)} | – 0.9025_{(0.1427)} | – 0.7844_{(0.1195)} |

η_{1} | 37.9655_{(3.5367)} | 42.8010_{(4.6698)} | 45.2033_{(4.4552)} | 45.6119_{(4.6333)} |

π_{0} | 0.6924_{(0.0541)} | 0.7140_{(0.0656)} | 0.3564_{(0.0454)} | 0.3050_{(0.0543)} |

π_{1} | 0.2527_{(0.0231)} | 0.4497_{(0.0968)} | 0.3435_{(0.0447)} | |

π_{2} | 0.1604_{(0.0245)} | 0.1877_{(0.0353)} | ||

π_{3} | 0.1189_{(0.0195)} | |||

Model selection | ||||

IC_{Q} | 578.1685 | 568.7670 | 587.4758 | 607.9914 |

Joint latent class models with | ||||

Two classes | Three classes | Four classes | Five classes | |

Estimates (SEs) | ||||

β_{0} | – 0.8052_{(0.3329)} | – 0.8272_{(0.3757)} | – 1.2187_{(0.3929)} | – 1.2587_{(0.4661)} |

β_{1} | 0.8401_{(0.4054)} | 0.9539_{(0.4579)} | 0.8342_{(0.3652)} | 0.6480_{(0.2779)} |

α_{0} | 0.0159_{(0.0024)} | 0.0152_{(0.0028)} | 0.0130_{(0.0023)} | 0.0117_{(0.0023)} |

α_{0, e} | 0.0090_{(0.0014)} | 0.0046_{(0.0008)} | 0.0042_{(0.0008)} | 0.0025_{(0.0005)} |

α_{0, a} | – 0.0043_{(0.0007)} | – 0.0048_{(0.0005)} | – 0.0053_{(0.0010)} | – 0.0054_{(0.0010)} |

α_{1} | 0.0289_{(0.0045)} | 0.0218_{(0.0042)} | 0.0156_{(0.0024)} | 0.0132_{(0.0026)} |

α_{1, e} | 0.0177_{(0.0020)} | 0.0075_{(0.0009)} | 0.0077_{(0.0013)} | 0.0079_{(0.0014)} |

α_{1, a} | – 0.0075_{(0.0010)} | – 0.0067_{(0.0012)} | – 0.0036_{(0.0006)} | – 0.0027_{(0.0003)} |

ξ | 0.7263_{(0.0860)} | 0.8361_{(0.1114)} | 0.8800_{(0.0845)} | 0.9013_{(0.1147)} |

σ_{0} | 0.0088_{(0.0011)} | 0.0063_{(0.0012)} | 0.0053_{(0.0010)} | 0.0048_{(0.0007)} |

σ_{1} | 0.0198_{(0.0033)} | 0.0127_{(0.0021)} | 0.0094_{(0.0015)} | 0.0087_{(0.0013)} |

ρ | 0.7788_{(0.0749)} | 0.7259_{(0.1195)} | 0.6599_{(0.1148)} | 0.6464_{(0.1240)} |

η_{0} | – 0.8556_{(0.0750)} | – 0.8892_{(0.1625)} | – 0.9025_{(0.1427)} | – 0.7844_{(0.1195)} |

η_{1} | 37.9655_{(3.5367)} | 42.8010_{(4.6698)} | 45.2033_{(4.4552)} | 45.6119_{(4.6333)} |

π_{0} | 0.6924_{(0.0541)} | 0.7140_{(0.0656)} | 0.3564_{(0.0454)} | 0.3050_{(0.0543)} |

π_{1} | 0.2527_{(0.0231)} | 0.4497_{(0.0968)} | 0.3435_{(0.0447)} | |

π_{2} | 0.1604_{(0.0245)} | 0.1877_{(0.0353)} | ||

π_{3} | 0.1189_{(0.0195)} | |||

Model selection | ||||

IC_{Q} | 578.1685 | 568.7670 | 587.4758 | 607.9914 |

The analysis is also performed under the assumption *L _{i}* = 0 for any

*i*; this amounts to assuming no association between PCBs and endometriosis. The parameter estimates are $\beta ^0=\u22120.5537$ (SE 0.2339), $\alpha ^0=0.0225$ (0.0035), $\alpha ^0,e=0.0172$ (0.0021), $\alpha ^0,a=\u22120.0113$ (0.0017), $\xi ^=0.6191$ (0.0586), $\sigma ^0=0.0231$ (0.0030), $\eta ^0=\u22120.5858$ (0.1217), $\eta ^1=35.5567$ (3.5701). The IC

*value is 612.4074.*

_{Q}To investigate the features of the individual PCB congener concentration, we computed empirical Bayes estimators of random effects $b~j0$ and $b~j1$, *j* = 1,2,…,62. The nonzero concentration levels for the *j*th PCB congener in the *k*th latent class can be estimated by replacing random effects with their empirical Bayes estimators and replacing unknown parameters with their estimates in (4.1). The estimated rate of observing zero concentration levels for the *j*th PCB can be similarly derived. Figures 4 shows, in the selected 3-class model, the nonzero concentration levels of each PCB congener. There is a positive relationship between PCB concentration mean and the risk class for all PCBs (i.e. the pattern in mean concentration for all PCBs is highest for high-risk participants and lowest for low-risk participants). This suggests that there is a combined contribution of all 62 PCB concentrations in assessing the risk of endometriosis. The differences in the PCB patterns from the low risk to the high risk are largest for estrogenic PCBs among the 3 PCB groups. Moreover, we see a reverse pattern for the probability of obtaining a zero PCB. The rate of zero varied across PCB concentrations and across classes (range 0–0.7).

The inferences for *β*_{1}, *α*_{1}, *α*_{1,e}, and *α*_{1,a} were similar across the different models (2- through 5-class models), suggesting that even though the data are best described by the 3-class model, important inferences are robust to model misspecification. Further, empirical Bayes estimators of concentration patterns in each of the risk groups (similar to Figure 4, but for 2-, 4-, 5-class models) showed similar results to those presented in Figure 4. Namely, concentration levels increased with an increasing risk group for all PCBs, and the increasing concentration levels were largest for estrogenic markers as compared to other markers (data not shown).

The joint latent class model provided strong evidence for an association between high-dimensional PCB exposure and the risk and endometriosis. The primary goal of the proposed latent class model is to provide a way to associate high-dimensional exposure data with the risk of endometriosis. The fact that the nature of the association (*β*_{1} in Table 1 and parameters that characterize PCB biomarker exposures) is similar across the models with different numbers of latent class provides reassurance about the robustness of important inference to the number of latent classes. That said, the evidence for 3 latent classes of risk does suggest that there may be 3 types of exposure patterns that are associated with the risk of endometriosis. The fact that overall patterns are higher in the high-risk group and are lower in the other groups suggests that what distinguishes the different groups are small differences in a majority of PCBs.

## SIMULATIONS

To evaluate the performance of the MCEM algorithm and the model selection criterion IC_{Q} for the joint latent class modeling approach, we conducted a simulation study by generating data similar to that in our application. We generated data according to a 2-, 3-, 4-, or 5-class model. The true parameter values were chosen to be similar to the estimated values of the corresponding models in our application example. Also, to create a scenario similar to the application, we used 79 participants and 62 biomarkers in each simulation. A total of 500 data sets were generated from each of the 4 latent class models. We assumed the number of latent classes was known and then fitted the model to the generated data. The results are reported in Table 2. For each model, the table presents the average and SD of the point estimates. We observed that the estimates were close to the true parameter values with reasonably small SDs. The parameters of most interest (e.g. *β*_{1}, *α*_{1}, *α*_{1,e}, and *α*_{1,a}) were nearly unbiased.

Truth | Average estimate | Estimate SD | Truth | Average estimate | Estimate SD | |

Two latent classes | Three latent classes | |||||

β_{0} | – 0.8052 | – 0.8097 | 0.3188 | – 0.8272 | – 0.8267 | 0.3823 |

β_{1} | 0.8401 | 0.8382 | 0.3912 | 0.9539 | 0.9504 | 0.3681 |

α_{0} | 0.0159 | 0.0159 | 0.0038 | 0.0152 | 0.0150 | 0.0035 |

α_{1} | 0.0090 | 0.0092 | 0.0029 | 0.0046 | 0.0045 | 0.0010 |

α_{0, a} | – 0.0043 | – 0.0043 | 0.0011 | – 0.0048 | – 0.0046 | 0.0014 |

α_{1, a} | 0.0289 | 0.0287 | 0.0068 | 0.0218 | 0.0218 | 0.0049 |

α_{0, e} | 0.0177 | 0.0178 | 0.0044 | 0.0075 | 0.0075 | 0.0020 |

α_{1, e} | – 0.0075 | – 0.0077 | 0.0048 | – 0.0067 | – 0.0069 | 0.0016 |

ξ | 0.7263 | 0.7214 | 0.1079 | 0.8361 | 0.8353 | 0.1525 |

σ_{1} | 0.0088 | 0.0086 | 0.0026 | 0.0063 | 0.0061 | 0.0019 |

σ_{2} | 0.0198 | 0.0198 | 0.0056 | 0.0127 | 0.0125 | 0.0024 |

ρ | 0.7788 | 0.7817 | 0.1324 | 0.7259 | 0.7231 | 0.1521 |

η_{0} | – 0.8556 | – 0.8548 | 0.1383 | – 0.8892 | – 0.8934 | 0.1475 |

η_{1} | 37.9655 | 38.1683 | 5.0883 | 42.8010 | 43.0890 | 5.3585 |

π_{0} | 0.6924 | 0.6914 | 0.0538 | 0.7140 | 0.7126 | 0.0837 |

π_{1} | 0.2527 | 0.2539 | 0.0376 | |||

Four latent classes | Five latent classes | |||||

β_{0} | – 1.2187 | – 1.2157 | 0.5768 | – 1.2587 | – 1.2604 | 0.6220 |

β_{1} | 0.8342 | 0.8363 | 0.3995 | 0.6480 | 0.6513 | 0.2804 |

α_{0} | 0.0130 | 0.0128 | 0.0032 | 0.0117 | 0.0120 | 0.0030 |

α_{1} | 0.0042 | 0.0043 | 0.0016 | 0.0025 | 0.0026 | 0.0009 |

α_{0, a} | – 0.0053 | – 0.0054 | 0.0017 | – 0.0054 | – 0.0055 | 0.0010 |

α_{1, a} | 0.0156 | 0.0157 | 0.0040 | 0.0132 | 0.0133 | 0.0029 |

α_{0, e} | 0.0077 | 0.0079 | 0.0026 | 0.0079 | 0.0077 | 0.0024 |

α_{1, e} | – 0.0036 | – 0.0038 | 0.0019 | – 0.0027 | – 0.0026 | 0.0007 |

ξ | 0.8800 | 0.8798 | 0.1348 | 0.9013 | 0.8971 | 0.1622 |

σ_{1} | 0.0053 | 0.0052 | 0.0016 | 0.0048 | 0.0044 | 0.0012 |

σ_{2} | 0.0094 | 0.0095 | 0.0019 | 0.0087 | 0.0086 | 0.0023 |

ρ | 0.6599 | 0.6651 | 0.1950 | 0.6464 | 0.6476 | 0.1260 |

η_{0} | – 0.9025 | – 0.9017 | 0.1487 | – 0.7844 | – 0.7837 | 0.1384 |

η_{1} | 45.2033 | 45.2051 | 7.3796 | 45.6119 | 45.3604 | 7.9095 |

π_{0} | 0.3564 | 0.3514 | 0.0535 | 0.3050 | 0.3014 | 0.0565 |

π_{1} | 0.4497 | 0.4490 | 0.0681 | 0.3435 | 0.3389 | 0.0763 |

π_{2} | 0.1604 | 0.1595 | 0.0835 | 0.1877 | 0.1908 | 0.0531 |

π_{3} | 0.1189 | 0.1179 | 0.0310 |

Truth | Average estimate | Estimate SD | Truth | Average estimate | Estimate SD | |

Two latent classes | Three latent classes | |||||

β_{0} | – 0.8052 | – 0.8097 | 0.3188 | – 0.8272 | – 0.8267 | 0.3823 |

β_{1} | 0.8401 | 0.8382 | 0.3912 | 0.9539 | 0.9504 | 0.3681 |

α_{0} | 0.0159 | 0.0159 | 0.0038 | 0.0152 | 0.0150 | 0.0035 |

α_{1} | 0.0090 | 0.0092 | 0.0029 | 0.0046 | 0.0045 | 0.0010 |

α_{0, a} | – 0.0043 | – 0.0043 | 0.0011 | – 0.0048 | – 0.0046 | 0.0014 |

α_{1, a} | 0.0289 | 0.0287 | 0.0068 | 0.0218 | 0.0218 | 0.0049 |

α_{0, e} | 0.0177 | 0.0178 | 0.0044 | 0.0075 | 0.0075 | 0.0020 |

α_{1, e} | – 0.0075 | – 0.0077 | 0.0048 | – 0.0067 | – 0.0069 | 0.0016 |

ξ | 0.7263 | 0.7214 | 0.1079 | 0.8361 | 0.8353 | 0.1525 |

σ_{1} | 0.0088 | 0.0086 | 0.0026 | 0.0063 | 0.0061 | 0.0019 |

σ_{2} | 0.0198 | 0.0198 | 0.0056 | 0.0127 | 0.0125 | 0.0024 |

ρ | 0.7788 | 0.7817 | 0.1324 | 0.7259 | 0.7231 | 0.1521 |

η_{0} | – 0.8556 | – 0.8548 | 0.1383 | – 0.8892 | – 0.8934 | 0.1475 |

η_{1} | 37.9655 | 38.1683 | 5.0883 | 42.8010 | 43.0890 | 5.3585 |

π_{0} | 0.6924 | 0.6914 | 0.0538 | 0.7140 | 0.7126 | 0.0837 |

π_{1} | 0.2527 | 0.2539 | 0.0376 | |||

Four latent classes | Five latent classes | |||||

β_{0} | – 1.2187 | – 1.2157 | 0.5768 | – 1.2587 | – 1.2604 | 0.6220 |

β_{1} | 0.8342 | 0.8363 | 0.3995 | 0.6480 | 0.6513 | 0.2804 |

α_{0} | 0.0130 | 0.0128 | 0.0032 | 0.0117 | 0.0120 | 0.0030 |

α_{1} | 0.0042 | 0.0043 | 0.0016 | 0.0025 | 0.0026 | 0.0009 |

α_{0, a} | – 0.0053 | – 0.0054 | 0.0017 | – 0.0054 | – 0.0055 | 0.0010 |

α_{1, a} | 0.0156 | 0.0157 | 0.0040 | 0.0132 | 0.0133 | 0.0029 |

α_{0, e} | 0.0077 | 0.0079 | 0.0026 | 0.0079 | 0.0077 | 0.0024 |

α_{1, e} | – 0.0036 | – 0.0038 | 0.0019 | – 0.0027 | – 0.0026 | 0.0007 |

ξ | 0.8800 | 0.8798 | 0.1348 | 0.9013 | 0.8971 | 0.1622 |

σ_{1} | 0.0053 | 0.0052 | 0.0016 | 0.0048 | 0.0044 | 0.0012 |

σ_{2} | 0.0094 | 0.0095 | 0.0019 | 0.0087 | 0.0086 | 0.0023 |

ρ | 0.6599 | 0.6651 | 0.1950 | 0.6464 | 0.6476 | 0.1260 |

η_{0} | – 0.9025 | – 0.9017 | 0.1487 | – 0.7844 | – 0.7837 | 0.1384 |

η_{1} | 45.2033 | 45.2051 | 7.3796 | 45.6119 | 45.3604 | 7.9095 |

π_{0} | 0.3564 | 0.3514 | 0.0535 | 0.3050 | 0.3014 | 0.0565 |

π_{1} | 0.4497 | 0.4490 | 0.0681 | 0.3435 | 0.3389 | 0.0763 |

π_{2} | 0.1604 | 0.1595 | 0.0835 | 0.1877 | 0.1908 | 0.0531 |

π_{3} | 0.1189 | 0.1179 | 0.0310 |

The simulation is also performed under the assumption that there is no association between PCBs and endometriosis. As in other simulation studies, the true parameter values of simulated model is chosen to be the parameter estimates in the application example (see the Footnote of Table 1). The average of parameter estimates of *β*_{0}, *α*_{0}, *α*_{0, e}, *α*_{0, a}, *ξ*, *σ*_{0}, *η*_{0}, and *η*_{1} are – 0.5563 (SE 0.2037), 0.0222 (0.0038), 0.0177 (0.0051), – 0.0113 (0.0021), 0.6214 (0.1535), 0.0230 (0.0039), – 0.5954 (0.1581), and 35.4520 (5.5767), respectively.

To examine the performance of model selection by IC_{Q}, we fit models having 2, 3, 4, and 5 latent classes to each of the generated data sets and used BIC-type IC_{Q} to select the best model. Supplementary Table 1 in the supplementary material available at *Biostatistics* online reports the frequencies of the selected models under each scenario. The correct selection rates are 92%, 81%, 79%, 77%, and 78%, respectively, for the no-association, 2-, 3-, 4-, and 5-class models. In addition, even when IC_{Q} selected the wrong models, the majority of the misselected models are the adjacent ones (one fewer or more class). Furthermore, these simulations showed that even under a misspecified model (e.g. true model was a 3-class model and 2-class model was selected), estimates of *β*_{1}, *α*_{1}, *α*_{1,e}, and *α*_{1,a} had little bias (data not shown).

## 6. DISCUSSION

This paper proposed a joint latent class modeling approach for predicting a binary event from high-dimensional biomarker data. In this approach, the relationship between complex biomarker expression patterns and disease risk is established by incorporating ordinal latent risk classes, which link the 2 modeling components. Further, complex marker-specific differences are characterized through marker-specific random effects. The approach is flexible in that it easily incorporates a varying number of ordinal latent classes which will depend on the application of interest.

The joint latent class model presented in this paper supplements other methods for analyzing disease-biomarker data, especially when high-dimensionality and collective effects occur. Alternative analysis strategies have been proposed for analyzing the endometriosis and PCB concentration data set. Buck Louis *and others* (2005), Gennings *and others* (2010), and Herring (2010) analyzed the data with logistic–regression-based models, focusing on assessing the effect of individual PCB exposures on endometriosis risk. In general, these methods found only a small number of associations between the large number of individual PCB exposures and the risk of endometriosis. Our approach is more flexible in that we can identify complex patterns between the high-dimensional exposures and disease prevalence using latent class structure. The works of Buck Louis *and others* (2005), Gennings *and others* (2010), and Herring (2010) may have advantages when only a few biomarkers have very strong association with risk. However, when there is a large proportion of exposure measurements and each contributes a small amount to the risk, their methods may fail. Our approach is well suited for this latter case, while also allowing for the case where a few individual PCBs may contribute to the risk.

The application of the approach to the epidemiology study on PCB concentrations and the risk of endometriosis revealed some interesting findings from a substantive view point. First, we identified a highly significant overall relationship between PCB concentrations and the risk of endometriosis, a finding which was not found with other techniques. Second, the model allowed us to reveal features of the relationship, which were not available with other methods. Particularly, we found an overall positive association of all biomarkers with the risk of endometriosis, suggesting that all PCBs, to different degrees, are positively associated with the risk of endometriosis. Further, we identified that estrogenic PCBs may be particularly important in determining the risk of endometriosis.

The methodology was developed for a binary measure of risk assessment, which was consistent with the application presented here. However, the model could be adapted to other generalized linear model outcomes (e.g. counts, continuous, exponential). Our focus was primarily on assessing the role of high-dimensional biomarkers on disease risk and not on evaluating disease prediction *per se*. Future research will focus on comparing features of model performance such as the receiver operating characteristic curve and other measures of diagnostic accuracy between our proposed modeling framework and more conventional methods such as logistic regression with model selection.

## SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

## FUNDING

Intramural Research Program of the National Institutes of Health; *Eunice Kennedy Shriver* National Institute of Child Health and Human Development; National Institute of Environmental Health Sciences (Grant 1R01ES09044-01).

We thank associate editor and reviewers for their thoughtful and constructive comments, which have led to an improved article. We thank Drs Germaine Louis and Brian Whitcomb for their helpful comments and suggestions. We also thank the Center for Information Technology, the National Institutes of Health, for providing access to the high performance computational capabilities of the Biowulf Linux cluster. *Conflict of Interest:* None declared.

## References

*Bayesian Data Analysis*.