-
PDF
- Split View
-
Views
-
Cite
Cite
Zhenke Wu, Maria Deloria-Knoll, Scott L. Zeger, Nested partially latent class models for dependent binary data; estimating disease etiology, Biostatistics, Volume 18, Issue 2, April 2017, Pages 200–213, https://doi.org/10.1093/biostatistics/kxw037
Close -
Share
Abstract
The Pneumonia Etiology Research for Child Health (PERCH) study seeks to use modern measurement technology to infer the causes of pneumonia for which gold-standard evidence is unavailable. Based on case-control data, the article describes a latent variable model designed to infer the etiology distribution for the population of cases, and for an individual case given her measurements. We assume each observation is drawn from a mixture model for which each component represents one disease class. The model conisidered here addresses a major limitation of the traditional latent class approach by taking account of residual dependence among multivariate binary outcomes given disease class, hence reducing estimation bias, retaining efficiency and offering more valid inference. Such “local dependence” on each subject is induced in the model by nesting latent subclasses within each disease class. Measurement precision and covariation can be estimated using the control sample for whom the class is known. In a Bayesian framework, we use stick-breaking priors on the subclass indicators for model-averaged inference across different numbers of subclasses. Assessment of model fit and individual diagnosis are done using posterior samples drawn by Gibbs sampling. We demonstrate the utility of the method on simulated and on the motivating PERCH data.
1. INTRODUCTION
Clinicians routinely use measurements to differentially diagnose a patient's unknown disease etiology and then choose a treatment from among those available. More often than not, the differential diagnosis is a qualitative process based on judgment and experience. As clinical measurements become more precise and complex and as the number of possible known etiologies grows, such qualitative processes are less likely to be optimal. An important question therefore is whether formal probabilistic calculations can improve clinical decisions when the relevant information is quantitative. For example, in the Pneumonia Etiology Research for Child Health (PERCH) study of childhood pneumonia (Levine and others, 2012), a vector of presence/absence indicators for a large number of pathogens is measured on each child by polymerase chain reaction (PCR) using specimens from the nasopharyngeal (NP) cavity. A clinical goal is to use the multivariate binary response to infer the pathogen in the child's lung causing pneumonia.
In addition, public health researchers are interested in estimating the population fraction of cases caused by each pathogen, referred to as the etiologic fractions or population etiology distribution (Feikin and others, 2014). Knowledge of the etiology distribution is essential for planning prevention and treatment programs. Because the lung cannot be directly sampled, except in cases of critical illness, imperfect measurements from the periphery are used to infer the latent state of the disease.
PERCH intends to infer for an individual case her latent lung infection status (, the latent state) by collecting multivariate binary measurements from the periphery. The joint distribution for is characterized by the true- and false- positive rates and the distribution of the latent disease-causing infection. Covariates such as age and HIV status can also influence the chance for each pathogen causing her disease.
In general terms, the PERCH scientific questions require inference about latent random variables. The same is true for many other problems, for example, biomarkers for disease diagnosis (e.g., Jokinen and Scott, 2010), words for learning topics of a text (e.g., Hofmann, 2001), and questionnaire items for evaluating severity of depression (e.g., Kroenke and Spitzer, 2002). One way of classifying latent variable models is by the discrete or continuous nature of their latent and manifest (observed) variables. Among them, “latent class” models (LCM) for discrete latent and discrete manifest variables were developed and widely applied since the 1950s (e.g., Lazarsfeld, 1950; Goodman, 1974).
LCMs constitute a family of distributions for correlated discrete measurements. The conventional LCM generally makes local independence (LI) assumption that observed variables are independent of one another given the latent class (e.g. Lord, 1952). In the multivariate binary case, individual 's measurement vector, , is thus linked to her latent class () by the simple product likelihood , where represents the collection of unknown measurement parameters — sensitivities and specificities if denotes the presence or absence of disease. We then obtain the observed likelihood by summing over all the possible values of , i.e., , where is a vector of unknown mixing weights of length . The LI assumption implies that the latent membership completely explains the marginal dependence in . Under local identifiability conditions (Jones and others, 2010), we can then estimate and by maximum likelihood to find the values that optimally reduce the observed dependence among measurements given latent class, e.g., through the expectation-maximization (EM) algorithms. Individual classification can then proceed by applying Bayes rule using the estimated parameters.
When classes are observed for some subjects, for example, motivated by the known control infection status , Wu and others (2016) introduced a “partially latent” class model (pLCM). The control sample provides the requisite information to estimate the specificities of the measurements. In the original formulation, they assumed LI for the multivariate binary measurements within each class. However, within cases or controls, several pairs of pathogens had observed log odds ratios that are inconsistent with their model-based predictive distributions based on LI. To address this lack of fit in the covariances, one approach is to extend pLCM by introducing dependence among measurements for persons within the same class. These associations have scientific value in their own right, for example, in studying patterns of pathogen–pathogen stimulation or inhibition.
Deviations from LI, or “local dependence” (LD) can occur in many applications, for example, in medical diagnostic tests when most severely diseased patients and the healthiest patients are easiest to correctly classify (Albert and others, 2001), or when tests target on similar genetic molecules (Qu and Hadgu, 1998). Many authors have noted that not accounting for LD can bias estimates of model parameters (e.g., Pepe and Janes, 2007). Therefore, in many applications where the LI model for is assumed, authors use model diagnostics to ensure valid model-based conclusions (e.g., Garrett and Zeger, 2000; Wu and others, 2016).
Ideas for relaxing LI can be distinguished by whether or not extra latent variables are introduced to induce correlation for []. Without doing so, for example, Harper (1972) modeled associations between pairs, triples, and higher order combinations of variables among given an latent class; Haberman (1979) used log-linear models for the vector to extend LCM viewing the latent class indicator as one of the category variables.
An alternative approach allows for dependence by using extra latent variables of continuous or discrete types or a mixture. For example, Qu and Hadgu (1998) used Gaussian random intercepts to induce within-subject symmetric and positive correlations among multiple diagnostic tests. Albert and others (2001) proposed to nest one extra unobserved subclass within each of two latent classes (diseased or non-diseased) to represent subjects measured without error. Dendukuri and others (2009) hierarchically layered extra mixed latent variables in a Bayesian framework. Adding extra latent variables can account for LD because any multivariate discrete distribution can be represented by a locally independent LCM with sufficiently many latent classes (Dunson and Xing, 2009, Corollary 1). However, when a satisfactory fit requires many classes — especially with high dimensions of manifest variables— interpreting inferred classes remains a difficult task.
The scientific background for this work points us toward the second strategy. LD could arise from multiple sources given the disease class. First, a pair of tests could be positively correlated if they cross-react for their respective targets, e.g., the probe for pathogen A will sometimes detect pathogen B, and vice versa. The PERCH study has carefully chosen the PCR targets to minimize cross-reactivity. For example, Bordetella parapertussis, a sister pathogen of B. pertussis, has not been included to avoid their cross-reactions. However, extra latent variables within each disease class can model such cross-reactivity if present. Second, correlations among tests can be induced by unobserved heterogeneity among subjects in the propensity for colonizing pathogens in their nasal cavities. We can also represent it empirically by a second level of latent variable within each disease class. With control data, we can estimate one or both sources of dependence (Section 2.2). Third, given a case's disease class, pathogen interactions (mutual stimulation or inhibition) will induce test dependence that would be difficult to distinguish from the prior sources.
In this article, we develop a novel latent variable model for multivariate binary data obtained from a case-control study. Using control data with a known class and assuming the covariation among control measurements is partly shared among the other latent classes for cases, we extend the traditional latent class approach to avoid the LI assumption. The proposed model is a natural extension of pLCM (Wu and others, 2016) and can be used to test its LI assumption.
We assume each child's measurements comprise an observation from a mixture model with component classes that represent the different pathogens that can cause her pneumonia. One primary goal of analysis is to estimate the probability distribution for these classes. To allow for LD, we introduce latentsubclasses nested within each of the ( case, 1 control) disease classes. Measurements within a subclass are assumed independent. We refer to the model as a “nested partially latent class model” or npLCM and use a prior to encourage small but variable numbers of subclasses that parsimoniously approximate the multivariate discrete dependence and avoid overfitting (Section 2.5).
We show that the proposed model is partially identifiable (Gustafson, 2015) and incorporate prior knowledge about measurement sensitivities to facilitate Bayesian estimation of the etiologic fractions. The npLCM is estimated via Markov chain Monte Carlo (MCMC) with designed precision to approximate the posterior distributions of the population etiologic fractions, individual latent state, as well as functions of them, such as the fraction of pneumonia cases caused by bacteria.
In Section 2, we formulate our model and discuss its statistical properties. Section 3 provides details on the posterior sampling algorithm to draw inference based on our model. Section 4 illustrates through asymptotic evaluations and finite-sample simulations the benefits of the new model relative to a version that ignores LD. Section 5 applies the proposed method to PERCH study data. Section 6 concludes with remarks on the method's advantages, limitations, and future extensions.
2. NESTED PARTIALLY LATENT CLASS MODEL
In this section, we specify the nested partially latent class model (npLCM) and consider its statistical properties using the PERCH study example to make the ideas concrete. Let comprise a -dimensional multivariate binary measurement collected for subjects , where the first subjects are cases and the remaining are controls. Let denote a case and denote a control.
2.1. Measurement likelihood
Borrowing measurement characteristics from controls to cases using subclasses for each disease class. Five pathogens (A–E) are measured in this example. for latent state or disease class; for multivariate binary measurements; (in shaded boxes) and (in blank dashed boxes) for true- and false-positive rates.
Throughout the article, we rely on the scientific assumption that each child's pneumonia is caused by a single primary pathogen. The more general case where disease can be attributed to multiple pathogens is a natural extension (Section 6).
2.2. Control likelihood
The control measurement distribution is assumed to take the form in Goodman (1974). Mutual dependence is induced by the existence of multiple subclasses, with each subclass having possibly distinct positive rate profiles. Given an unobserved subclass, measurements are assumed to be mutually independent. Marginalizing over the latent subclasses produces dependence for pathogens with different rates across subclasses. The formulation is natural for PERCH given the heterogeneity in the health status of controls. For example, the subclasses can represent the subjects' strength of immunity that could affect the rates of pathogen detection.
2.3. Case likelihood
We can reformulate (2.2) by a three-stage generative process similar to controls by indicators of case disease classes and the nested subclasses : Categorical; Categorical; and Bernoulli, independently for . At the first stage, the vector comprises probabilities of a case in class 1 to and is the primary target of inference in this article. Then, the cases' subclass mixing weights determines the probability of a case falling into each subclass. The final stage generates the measurement at the -th dimension: positive with probability or according as the realized values of and in previous steps. Because is the probability of true detection for infections caused by pathogen , we term it true positive rate (TPR) and collect them in for subclass .
Importantly, case and controls' subclass mixing weights ( and ) need not be identical. This admits different measurement dependence structures for cases than controls, which could arise, for example, if stronger pathogen interactions appear in cases' NP cavity due to presence of the lung infection. We refer to the special case (element-wise equality) as non-interference submodels, under which controls and cases of class have identical distributions of the leave-one-dimension-out measurement vector . Setting , or , gives the pLCM.
We have assumed cases' latent state categories take value from a complete list of measured pathogens (i.e., ). The case likelihood (2.2) can be extended to account for other causes by adding an extra term: where is the total etiology fraction of other causes. For a clinically confirmed pneumonia case, negative responses on pathogens by highly sensitive assays indicate the possibility of other etiologic pathogens.
Combining (2.1) and (2.2), the joint likelihood across independent subjects is given by where collects all the data.
2.4. Properties
The proposed model extends pLCM in Wu and others (2016) by adding additional parameters compared to the original formulation with the total number of parameters linear in when providing a parsimonious approximation to the case and control joint distributions that require parameters in a saturated model. We further reduce the effective number of parameters using a stick-breaking prior (Section 2.5).
We assumed that the LD of measurements within each case class can be explained by allowing the same number of LI subclasses as in the controls, so that the case subclass measurement parameters can be partly informed by their control counterparts (Stage 3 of case data generating process). Additional case subclasses can be included once is directly observed for some cases.
In Appendix S1 of supplementary material available at Biostatistics online, we provide expressions of the marginal means and pairwise associations for multivariate binary measurements given the npLCM likelihood. These formulas are used to study the magnitude of dependence given true parameters and to generate marginal posterior distributions for observables used in model checking, as illustrated in Section 4.1 and 5.
2.5. Prior specifications
In Appendix S2.1 of supplementary material available at Biostatistics online, we specify the priors for the unknowns in npLCM . Given our primary interest in , the dependence structures within each disease class are nuisance parameters. Appendix S2.2 of supplementary material available at Biostatistics online discusses the use of stick-breaking prior to encourage random small numbers of subclasses that prevents model overfitting in finite samples by approximating the dependence structure parsimoniously. The specified priors are conjugate to the likelihood of unknown parameters, making the Gibbs sampler in Section 3 conveniently constructed.
3. POSTERIOR COMPUTATIONS
The posterior distributions of the population etiology fraction vector (), TPRs () and FPRs () can be estimated by simulating approximating samples from the joint posterior via MCMC algorithms. Figure S1 of supplementary material available at Biostatistics online presents the directed acyclic graph (DAG) for the model structure. Appendix S3 of supplementary material available at Biostatistics online details the sampling algorithms. All model estimations are performed by the R package “baker” (https://github.com/zhenkewu/baker).
4. ASYMPTOTIC AND SIMULATION STUDIES OF NESTED PARTIALLY LATENT CLASS MODELS
This section presents asymptotic and simulation studies to show that for cases like PERCH (i) when the LI assumption is incorrect, a working LI model will estimate with asymptotic bias; (ii) fitting the LD model to data generated with LI does not lose too much efficiency using sparse priors on subclass indicators; and (iii) compared to the LI model, the LD model produces 95% credible intervals for with better actual coverage rates.
4.1. Asymptotic bias evaluations
We first evaluate the asymptotic bias of the maximum likelihood estimator (MLE) for obtained from the working LI model (pLCM) using data generated by npLCM. Let be the true etiologic fractions, FPRs and TPRs. Let be the data, where is the total number of cases and controls. Fewer parameters fully specify pLCM: given disease class , the marginal TPRs and FPRs are functions of defined by and , , respectively. We fix at the true value , to eliminate the partial-identifiability issue and to focus on asymptotic bias evaluations. We then estimate the etiologic fractions , as well as . In this case, with large sample sizes, it must be expected that the Bayes estimate will behave in a similar way to the MLE. We study the performance of the Bayes estimates in Section 4.2 when the TPRs are not fixed.
Under LI, let be the MLE for the etiology fractions, where the last element equals . Let be the MLE for the marginal FPRs. Collected into one vector, jointly converges to , possibly different from the truth .
We obtain the limit by minimizing the Kullbac–Leibler information criterion, or equivalently, by solving the equation, . It is a weighted average for cases and controls with weights determined by their sample fractions in the limit as . The expectation is taken with respect to . Finally, is the pLCM likelihood (Wu and others, 2016) parameterized by . We use Monte Carlo samples from the true distribution to evaluate the expectation and the limit above and then numerically solve for its root . Our calculation assumed equal case and control sample sizes when , and could be easily modified for other sampling ratios.
We also characterize the true uncertainty of the MLE obtained from a possibly mis-specified working model. White (1982) established its asymptotic normality and provided the exact form of the asymptotic variances. Applied to our investigation here, the estimator satisfies where , and . We compute the robust variance of defined by as follows: (i) plug the obtained above into the first and second partial derivatives and (ii) approximate A and B using Monte Carlo samples.
The strength of LD given disease class determines the estimation bias. When the true data generating mechanism is close to independence, the working LI model estimates of are close to being asymptotically unbiased. To illustrate, we quantify the asymptotic bias for binary measures (pathogens A to E). We generate Monte Carlo samples from the true data generating mechanisms with varying degrees of LD, while fixing the etiologic fraction to mimic what is seen in PERCH. We create associations among measurements by defining two subclasses () for each of the 6 disease states (controls plus 5 disease classes for cases). We consider two scenarios of measurement parameters : little (I) and substantial (II) LD — small versus large between-subclass differences in positive rates (see Appendix S5.1 of supplementary material available at Biostatistics online.).
In Scenario I–II, Top (a): The true data generating mechanism summarized by pairwise odds ratios for cases (upper right, solid lines) and controls (lower left, solid lines) as the cases' first subclass weight () increases from 0 to 1. The pairwise odds ratios within each case class are shown by non-solid lines (legend at bottom). Pairwise independence is represented by the dotted horizontal lines for reference. The correlations of C with others are highlighted in shaded cells. Bottom (b): PRAB for estimating etiology fractions using working LI model when the truth varies across a range of LD settings parametrized by .
Row (b) of Figure 2 shows the Percent Relative Asymptotic Bias (PRAB) for each etiologic fraction, , at all values. The working LI model produces PRABs less than 13% in magnitude in Scenario I. Given small asymptotic biases, we also obtain good estimates of precision produced by the working LI mode,l with the ratios for model-based variance versus the robust variance between and for A–E. The two variances are mathematically identical at arbitrary parameter values if the marginal FPRs () are known.
The asymptotic bias is large under strong LD. For example, in Scenario II, the working LI model overestimates with 121.3% relative bias at for its failure to account for the strong control LD. When the case LD is more similar to controls at , the PRAB is 40.5%. This is because the measurement on C is negatively associated with the measurements on B, D, or E given disease class B, D, or E, i.e. mutual inhibition (see shaded cells in Figure 2, a–II), leading to the case pattern observed twice as frequently as expected by an LI model. When they are further assigned with the highest likelihood to cause C under the working LI model, the upward bias results.
4.2. Bayesian fitting in finite samples
Appendix S5.2 of supplementary material available at Biostatistics online presents extensive finite-sample simulations to show that npLCM has much smaller biases in estimating the etiologic fractions under strong LD and negligible biases if under weak LD. When the truth is close to LI, the npLCM is comparably efficient to pLCM for almost all settings. It also produces 95% credible intervals (CI) with near-nominal empirical coverage rates.
5. ANALYSIS OF PERCH DATA
The Pneumonia Etiology Research for Child Health (PERCH) study is a case-control study with 4000 patients hospitalized for severe or very severe pneumonia and over 5000 controls aged 1–59 selected randomly from the community, frequency-matched on age in each month. Its objective is to evaluate etiologic agents causing severe and very severe pneumonia among hospitalized children in seven low and middle income countries with a significant burden of childhood pneumonia (Levine and others, 2012). PERCH will enable estimation of the population fraction of cases caused by each pathogen (Feikin and others, 2014) that is essential for planning prevention and treatment programs. Because the lung cannot be directly sampled, except in cases of critical illness, imperfect measurements from the periphery are used to infer the latent state of the disease for each case that collectively comprise the population. More details about the PERCH design and objectives can be found in Deloria-Knoll and others (2012).
Using preliminary PERCH data from one site, we focus on PCR assays on NP specimens for cases and controls. We illustrate the advantage of the npLCM in accounting for measurement LD, with improved efficiency, better empirical fit, and more valid etiology estimation. Results for all seven countries will be reported elsewhere upon study completion. Included in the current illustrative analysis are NPPCR data for 592 cases and 613 controls on 6 species of pathogens (abbreviations and full names in Appendix S6 of supplementary material available at Biostatistics online).
We have compared the population etiology fractions, , estimated separately by two methods: the pLCM and the npLCM with subclass truncation level . The npLCM results are similar when larger values of s are used. Note that the MCMC algorithm always assign non-zero weights to all the subclasses, but most weights are almost always negligible () in our analyses. As discussed in Section 2.5, we need expert prior knowledge on the sensitivities for posterior inference by both methods; we used elicited sensitivity priors from laboratory experts with range . Given our focus on 6 leading pathogens, we include the “other” cause for completeness as discussed in Section 2.3.
Strong LD is present in the analyzed data, with statistically significant log odds ratios observed for 6 out of 30 pathogen pairs among cases and controls, ranging from to , and also by noting that under LI assumption we expect such pairs. In addition, as noted in Berger and Sellke (1987) and Dunson and Xing (2009), the interval null hypothesis , is useful for detecting deviations from the point null of exact LI for cases. We choose based on experience in simulation studies and to permit deviations from LI so small as to be non-significant in our application. The largest subclass weight is estimated with 95% CI for the cases and for the controls, again suggesting non-negligible LD in the data.
Top: Comparison of the posterior distributions of between the pLCM (left) and npLCM (right); The numbers above are the posterior means (). Bottom: PPD for 10 most frequent multivariate binary patterns separately for cases (left panel) and controls (right panel). The observed frequencies are overlayed as short segments across pairs of box-and-whiskers; the means of the PPDs () are shown above them in actual numbers.
Individual disease etiology predictive distributions. Here four NPPCR data patterns are represented by the binary codes at the top (no measurements on “other” causes hence left as “-”), with its observed frequency marked beneath. The height of a bar represents the probability of a case caused by each of the seven causes labelled on the horizontal axis. For each cause, paired bars compare the estimates from the pLCM (left) and the npLCM (right); Extra predictions are in Figure S4.
The npLCM also provides a better empirical fit. We have compared the posterior predictive distributions (PPD) (Gelman and others, 1996) of the frequencies of common NP measurement patterns to the observed values separately in the cases and the controls. Among cases (left panel in Figure 3(b)), for example, the npLCM adequately predicts the observed frequencies of the 2nd and 6th most common case patterns (000001: 12.5%; 000100: 5.4%) by accounting for the negative associations of RSV with other pathogens with the log odds ratios ranging from −3.37 to −0.12 (3 out of 5 statistically significant).
We also examine the pairwise associations by calculating the standardized LOR difference (SLORD) defined to be the observed LOR for a pair of measurements minus the mean LOR for the predictive distribution value from each method divided by the standard deviation of the LOR predictive distribution. Figure S3 of supplementary material available at Biostatistics online shows nine pairs of pathogens that have statistically significant deviations of model predicted LORs from the observed ones for the pLCM and only three pairs for the npLCM. A blank cell indicates a good model prediction for the observed pairwise LOR (|SLORD| < 2). The npLCM achieves a better fit by noting that, for a well-fitting model, we expect non-blank cells. The associations between pairs of measurements (HMPV-A/B,RSV) and (PARA-1,RSV) are not expected in either model, although npLCM does better. In the PERCH study, we observed that seasonal variation in the rate of detection for RSV, HMPV-A/B and PARA-1 were out of phase and seasonal regression adjustment, discussed elsewhere, can sensibly account for this negative association.
6. DISCUSSION
In this article, we derived and tested a nested pLCM to allow for local dependence among binary observations given class membership. We compare this new model with a special case that depends on local independence in terms of asymptotic and finite sample size properties. The npLCM reduces large-sample estimation bias, retains the estimation efficiency and gives more valid inferences about than the pLCM. The npLCM family also makes it possible to study the sensitivity of scientific findings to the LI assumption when pLCM is used.
The model first approximates the probability distribution for the control measurements by a mixture of product Bernoulli distributions with mixing weights encouraged towards a mixture with few components. The estimated control dependence structure is then applied to the case model with modifications that represent the influence of the latent disease state. This valuable information from controls may help distinguish competing models for the local dependence among measurements and warrants further studies (e.g. Albert and others, 2001).
In the analysis of six leading pathogens from the PERCH study, RSV is estimated to be the most prevalent infectious cause of childhood pneumonia except the “other” category. That evidence is robust to the LD assumption. Accounting for LD structure leads to notable increases in etiologic fraction estimates of two pathogens and decrease in another. The npLCM can also integrate extra measurements of better qualities, for example, blood culture tests for bacteria that have near-perfect specificities to inform TPRs and improve efficiency (Hammitt and others, 2012).
In this article, we assumed a single primary cause for each pneumonia case in the npLCM. This framework can be extended from a single to multiple causes by using a latent vector for case , , where indicates pathogen is a component cause. For example, Hoff (2005) used Dirichlet process mixture models to identify multiple abnormal genomic locations that are jointly responsible for each case's disease, but using case-only data with LI assumption. Alternatively, one can place an exponential decaying prior on the number of causes, or use conditionally specified models to characterize the interactions among pathogens (Besag, 1974), where is a vector of covariates predictive for pathogen being a cause in case . The computational cost to fit these models increases substantially because the search space for the latent vector expands exponentially in . Development of efficient and reliable posterior sampling algorithms can allow investigators to assess the evidence of multiple-pathogen etiologies as more measurements accrue.
A second extension of the npLCM family motivated by PERCH is to allow the etiology distribution and false positive rates to depend upon covariates. For example, season, child's age and HIV status. Regression versions for npLCM have been implemented and are the subject of current study.
A critical assumption on which the model depends is that the source of within-class associations is similar for cases and controls, that is , for and . If the sources of correlations are substantially different for cases than controls, it would impair the proposed model's capacity to draw valid inferences.
Finally, Wu and others (2016) derived the pLCM model to be used with a combination of direct measurements of cases' lungs without error and peripheral measures of cases and controls with error. With gold-standard data, this analyses is an example of supervised learning. The npLCM can be used in the same way. In the PERCH application, we rely entirely on peripheral samples, so the analyses is largely unsupervised. Robustness of inferences to model assumptions is critical.
SUPPLEMENTARY MATERIAL
Supplementary material is available at http://biostatistics.oxfordjournals.org.
ACKNOWLEDGMENTS
We thank the members of the PERCH Study and Expert Groups for discussions that helped shape our statistical approach, and the study participants. Research reported in this work was also partially funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1408-20318). (See Supplementary materials available at Biostatistics online for full acknowledgments.) Conflict of Interest: None declared.




