A model-based approach to Bayesian classification with applications to predicting pregnancy outcomes from longitudinal β-hCG profiles

This paper discusses Bayesian statistical methods for the classification of observations into two or more groups based on hierarchical models for nonlinear longitudinal profiles. Parameter estimation for a discriminant model that classifies individuals into distinct predefined groups or populations uses appropriate posterior simulation schemes. The methods are illustrated with data from a study involving 173 pregnant women. The main objective in this study is to predict normal versus abnormal pregnancy outcomes from beta human chorionic gonadotropin data available at early stages of pregnancy.


INTRODUCTION
In different biomedical situations, markers are needed for early detection of the onset of a specific disease, taking into account any longitudinal information that becomes available.One such example concerns pregnant women.To detect a number of complications arising during pregnancy, a variety of quantities or characteristics are measured at antenatal examinations (Yamashita and others, 1989;Witt and others, 1990;Hahlin and others, 1991).One of these is the beta human chorionic gonadotropin (β-hCG), also known as the "pregnant hormone" or "announcer of pregnancy," which keeps the corpus luteum (yellow body) producing progesterone and estrogen during the early stages of pregnancy.The β-hCG hormone is produced by the placenta.It is detectable in the blood and urine within 10 days of fertilization.After the fertilized egg implants or attaches to the inside of the uterus or other structure inside the mother, the levels of β-hCG rise rapidly, frequently exceeding 100 mIU/ml.The levels continue to increase throughout the first trimester of pregnancy and reach a peak 60-80 days after the fertilized egg implants.The exact level of β-hCG in the blood can be measured using standard tests.These measurements can help giving a rough estimate of the age of the fetus.They can also help to determine if the pregnancy is progressing normally.Levels that are abnormally low or high may be signs that an abnormal medical condition is present.This would suggest the need for further evaluation and testing.
There is a great variation in β-hCG levels.However, it is not the absolute value that matters, but their relative changes.In a normal pregnancy, the level of this hormone approximately doubles every 1.5 days up to 5 weeks after the last menstrual period, and then every 3.5 days from the 7th week onward (Frits and Guo, 1987).After the first trimester, levels should gradually decrease over time and quickly decrease to zero after the pregnancy is ended.However, abnormally large levels of β-hCG may indicate choriocarcicoma of the uterus, down syndrome in the fetus, hydatidiform mole of the fetus, or ovarian cancer.Lower than normal β-hCG levels may indicate ectopic pregnancy, a miscarriage, or spontaneous abortion.In any case, a failure to exhibit normal growth patterns in β-hCG levels should usually be interpreted as a complication of pregnancy.
The main objective of this article is to explore a classification technique for predicting the outcome of pregnancy on the basis of the results of certain diagnostic tests administered to pregnant women.The inference problem is formally described as a discriminant analysis based on longitudinal β-hCG outcomes.Our approach is fully Bayesian and provides the posterior (or predictive) probability of outcome of pregnancy in women based on the longitudinal marker.Physicians can then make decisions on the basis of these probabilities.We consider a hierarchical structure that accommodates nonlinear profiles and allows classification for individuals with very few observations.This is specially relevant in the context of our motivating example, where more than one-third of the women had either one or two β-hCG measurements.
Extensions of classical discriminant analysis to multivariate response curves observed over fixed time intervals have been considered by Albert (1983).Some extensions to the case of unbalanced data have also been proposed, typically using linear or nonlinear random-effects models to describe the longitudinal profiles in each group.Verbeke and Lesaffre (1996) proposed the so-called heterogeneity model, i.e. a linear mixed-effects model with random effects sampled from a mixture of normal distributions.Verbeke and Molenberghs (2000) indicated that the classification rule implied by the heterogeneity model is equivalent to the discriminant function proposed by Tomasko and others (1999).Further developments along this direction have been discussed in Wernecke and others (2004).Marshall and Barón (2000) considered nonlinear random-effects models to describe evolutions in different groups and stated the optimal allocation rule.Brown and others (2001) discussed discriminant analysis using linear random-effects models from a Bayesian viewpoint.
In our approach, the classes or groups are predefined and the task is to understand the basis for the classification from a set of labeled subjects (training data set).This information is then used to classify future subjects.In the case that the classes or groups are unknown a priori and need to be estimated from the data, latent class models offer a fruitful approach.See additional developments along this direction in Muthén and Shedden (1999), Lin and others (2000), McCulloch and others (2002), Muthén and others (2002), and references therein.
The rest of this article is organized as follows: We first give a brief description of the data set in Section 2. In Section 3, we extend the framework of traditional classification methods to the longitudinal hierarchical setting.In Section 4, we illustrate the proposed longitudinal method using data from Santiago, Chile on the β-hCG measured in women with normal and abnormal pregnancy outcomes.An appropriate posterior simulation scheme based on the Gibbs sampling algorithm is described.Finally, Section 5 discusses the results.

PREGNANT WOMEN DATA
There is an increased risk of early pregnancy loss after assisted reproduction.Various studies have addressed the question about early β-hCG values and their relationship to pregnancy outcome after in vitro fertilization (e.g.Yamashita and others, 1989).We consider a data set with a total of 173 young women, representing different pregnancies over a period of 2 years in a private fertilization obstetrics clinic in Santiago, Chile.The β-hCG concentrations for the 173 women were measured during the first 80 days of gestational age.One of the main targets of the study was to evaluate these concentrations at early stages of pregnancy, with the purpose of identifying women with a high risk of loss.Consequently, pregnancy outcomes were divided into two groups: normal and abnormal.The women were classified as normal pregnancies if they had a normal delivery or as abnormal pregnancies if they had any complication resulting in a nonterminal delivery and loss of the fetus.The resulting data set consists of 124 patients diagnosed with normal pregnancy and 49 patients with abnormal pregnancy.The 173 women altogether contribute a total of 375 observations, where the number of samples per woman ranged from 1 to 6 (median 2).These data were originally presented in Marshall and Barón (2000).
We analyze the vectors of time-varying β-hCG measurements for the 173 women.Approximately 30% of the 173 women had one β-hCG measurement, 31% had two, 33% had three, and 6% had four or more measurements.Figure 1 presents the subject-specific log 10 β-hCG profiles for both groups.
The two populations appear clearly distinct when considering the ensemble of profiles.However, for any one of the profiles, the classification into one or the other subpopulation is far less certain, in particular when considering series of partial responses.Thus, our main inference goal in analyzing these data is to provide a classification rule for a new patient.The rule should allow sequential updating as data accrue for the new patient.

CLASSIFICATION USING NONLINEAR HIERARCHICAL MODELS
Suppose we are given a training data set comprising m units {(y i , z i ), i = 1, . . ., m}.Here y i = (y i1 , . . ., y in i ) ∈ n i represents the observed response vector for the ith unit, taken at arbitrary times t = (t 1 , t 2 , . . ., t n i ).Let z i ∈ {1, 2, . . ., g} denote the known group label for the ith unit.In our application, m = 173, g = 2, with z i = 1 and z i = 2 indicating normal and abnormal pregnancy, respectively.The label z i is known for some women with already reported delivery, but unknown for women with partial data before delivery.Without loss of generality, we assume that z i , i = 1, . . ., m, is known, and z m+1 is unknown.Let y m = (y 1 , . . ., y m , z 1 , . . ., z m ) denote all data, including the recorded class memberships z i , up to the mth patient.
Classification can be done either on a within-sample or on a predictive basis.For the former, the posterior probabilities {p(z i = k|y m ); k = 1, . . ., g} are appropriate, and these can be directly estimated as empirical averages in their Markov chain Monte Carlo run.The predictive classification problem assumes that a future unit m + 1 with as yet unknown label z m+1 is recorded, so interest focuses on inference about z m+1 , i.e. we are in principle interested in {p(z m+1 |y m , y m+1 )}, where y m+1 is the currently available partial response vector for the new patient m + 1.
We consider for group k a generic hierarchical model of the form In words, data y ik for the ith sampling unit in group k are sampled from a probability model parameterized by a vector θ ik .We will assume the parameter vector θ ik to be partitioned into common fixed effects θ F k and unit-specific random effects θ R ik .The θ R ik are assumed to be generated from a distribution parameterized by a d-dimensional hyperparameter vector φ k , thus implying a parametric model for random effects.Bayesian inference under parametric assumptions for random effects has been discussed by Bennett and others (1995), Racine-Poon (1985), Wakefield and others (1994), Wakefield (1996), Wakefield and Bennet (1996), and Ziegelmann and Brown (2001), among others.
For the top-level sampling model p(y ik |θ ik ) in (3.1), we assume a nonlinear regression with a mean function f (θ ik ; •) parameterized by θ ik and evaluated at known times t i jk , j = 1, . . ., n i , k = 1, . . ., g.The residual term i jk is assumed to be normally distributed with mean 0 and variance σ 2 k .To facilitate classification, we augment the model with a marginal probability for z i with k = 1, . . ., g. Specific choices for our motivating example will be discussed in Section 4. Note, however, that the augmented model implies the desired classification as a conditional probability p(z m+1 |y m , y m+1 ), marginalizing with respect to the unknown θ i , and other possibly unknown hyperparameters.
From a Bayesian viewpoint, the classification probabilities are obtained by weighting the posterior distributions of the parameters.Using Bayes' rule and some algebraic manipulations, the classification probability that a new unit y m+1 belongs to the kth group is However, the integration is usually analytically intractable.Therefore, we shall construct a set of stationary samples (3.4) In other words, the unit is classified in that group for which the highest posterior probability is attained, thus minimizing the expected misclassification rate.Of course, there are other loss functions that could be used, depending on whether it is more important to avoid false-positive or false-negative cases (see Hastie and others, 2001).

APPLICATION
We apply the proposed model to the analysis of the longitudinal β-hCG data discussed in Section 2.

Model specification
Mean values of the log 10 β-hCG for these 173 women show a nonlinear relationship with days of pregnancy.Figure 1 shows time profiles for normal and abnormal subjects.The analysis in Marshall and Barón (2000) suggests that woman-to-woman variation is adequately accounted for by the introduction of random effects to model the asymptotic behavior of the log 10 β-hCG level (θ ik below).Specifically, they proposed the following nonlinear random-effects model: where y i jk represents the jth log 10 β-hCG measurement on the ith woman in group k, where k = 1, 2 represent, respectively, normal and abnormal pregnancy groups, t i jk is the day at which the y i jk measurement was recorded, and the θ ik , i = 1, . . ., 173, are assumed N (θ k1 , τ 2 k ) and independent of the e i jk .The parameters θ θ θ 1 = (θ 11 , θ 12 , θ 13 ) and θ θ θ 2 = (θ 21 , θ 22 , θ 23 ) represent the population parameters of the logistic curve for the normal and abnormal pregnancy groups, respectively.
After integrating out random effects in the kth group, the likelihood in the kth group is still described by means of a normal distribution where the mean vector µ µ µ k (t ik ) has elements and represents the population curve at time t i jk .Furthermore, I is an n i ×n i identity matrix, and the vector v ik has elements which depend on the values of the unknown population parameters θ k2 and θ k3 .We complete the Bayesian formulation of Model (4.1) assuming prior independence for parameters, with distributions for k = 1, 2. Here, IG denote the inverse gamma distribution.In practice, the specification of hyperparameters θ θ θ 0 , D 0 , a 1 , c 1 , a 2 , and c 2 may be difficult, so we choose in Section 4.2 hyperpameter values defining vague priors.

Posterior computation
Posterior distributions of the parameters were estimated using the Gibbs sampling algorithm.The values of the hyperparameters in (4.2), (4.3), and (4.4) were taken as α α α 0 = (0, 0, 0) T , D 0 = 1000 I 3 , a 1 = a 2 = 3, and c 1 = c 2 = 0.01, which give the prior variance of σ 2 and of τ 2 to be 2500.The resulting prior densities are proper, but vague and hence relatively uninformative.Prior probabilities of group membership were assumed proportional to the size of the groups in the training sample.We also performed the analysis with different hyperparameter values, obtaining very similar results.This suggests robustness to the hyperparameter choices.
The full conditionals for implementing the Gibbs sampler for α k1 , b ik , σ 2 k , and τ 2 k are straightforwardly derived as normal, normal, inverse gamma, and inverse gamma distributions, respectively.The full conditionals for α k2 and α k3 are not available in closed form.A separate random walk Metropolis algorithm was used to simulate each α k ( = 2, 3).For this, we use a normal proposal distribution centered at the current value of α k with variance given by a constant c times the inverse information matrix.The tuning constant c controls the acceptance rate of the algorithm.If c is too small, then the acceptance rate is high, but jump sizes are correspondingly small, yielding slow convergence.Conversely, the selection of high values of c leads to larger jump sizes, but at the cost of lower acceptance rates.Following the recommendation of Gelman and others (1996), c shall be selected so as to yield empirical acceptance rates around 0.25.Consequently, we chose c = 2.
To perform the Gibbs sampling, we chose starting points in a neighborhood of the maximum likelihood estimates of model parameters.For this we used the library NLME of Pinheiro and Bates (2000) to fit the models.The algorithm was implemented using computer code written in C. We generated 250 000 iterations.After 10 000 iterations, samples were collected, at a spacing of 240 iterations, to obtain approximately independent samples, giving samples of size 1000 for calculating posterior quantities of interest.
We checked convergence for each monitored variable, using the CODA software of Best and others (1995) to produce convergence diagnostics.Application of Geweke's (1995) convergence criterion separately to each of the model parameters showed no evidence against convergence, as the absolute value of the Z -statistic was less than 1.7 in all cases.Also, all sequences passed the stationarity test of Heidelberger and Welch (1983).

Results
The mean gestational ages (days) in women with normal and abnormal pregnancies were 34.5 and 32.9 days, respectively; this difference was not significant.We found that concentrations of β-hCG were significantly lower in women with an abnormal pregnancy than in those with a normal pregnancy.Table 1 presents posterior summaries of the parameters.Observed differences in the group-specific estimates of the variance components models suggest that there is more between-subject variability in the abnormal group than in the normal group.
As part of the analysis, we estimated individual log 10 β-hCG profiles and standard errors.Fitted profiles with ±2 SD curves are displayed for six selected patients in Figure 2. Three of them were in the normal group (patients 3, 24, 66) and the remaining three in the abnormal group (patients 27, 34, 45).We note that the inference captures the varying observation error between subjects.An important feature of the Bayesian approach is that the credible intervals take into account the variability of all parameters, including unknown hyperparameters.We note here that all these patients, except patient 34, were correctly classified according to our proposed approach.
Once the posterior probabilities p(z i = k|y m ) are estimated, several quantities of interest can be readily evaluated.Among these, "sensitivity" and "specificity" are popular ways to summarize the classification results.Letting A = {the patient is classified as abnormal} and E = {the patient actually belongs to the abnormal group}, then sensitivity and specificity are defined, respectively, as Pr I {z i = 1}, respectively, where ẑi was defined in (3.4).From our results we found 57% sensitivity and 95% specificity.
We also computed the misclassification error rate, which was found to be 16% for this data set.However, it is well known that the error rate obtained by applying the classifier to the same data from which it has been formed tends to be biased downward as an estimate of the true error rate (Hastie and others, 2001).Several methods are available to solve this problem.For moderately large data sets, we could consider a series of randomly chosen divisions of the data into two components, one reserved for deriving the classification rule (the training sample) and the other to assess this rule (the test sample).Under this method, the estimated error rate is the average error rate over all the generated divisions.For smaller data sets like the one at hand, a cross-validation (CV) technique can be used to compensate for the lack of data.Specifically, we consider the leave-one-out cross-validation method, which consists of splitting the m samples into a training sample of size m − 1 and a test sample of size 1.The estimated error rate is then the average number of times the test sample was misclassified over the m possible divisions.A critical advantage of this method is that all the data can be used for training, without the need to set aside subjects purely for testing.The CV error rate is an important statistical estimator of the performance of a classification rule when the sample size is small.It is frequently used by many researchers.A direct generalization of the above procedure is the k-fold CV, where the data are divided randomly into k components of approximately equal sizes.Next, k − 1 components are used to compute the classification rule, which is then tested on the omitted component.This process is repeated k times, until all components are used to assess the classification rule, and the error rates averaged.We performed both leave-one-out CV and 5-fold CV obtaining misclassification error rates of 17 and 17.3%, respectively.Both error rates are very similar, which suggests that these methods provide good estimates that are reasonably consistent.See further properties and discussion of these methods in Hastie and others (2001).
In the above analysis, we have used all the available information.However, it is interesting to assess the predictive power of our model.A possible way of doing this is to study how the classification probabilities change with the number of available observations, for a given patient.Thus, we generate from the corresponding posterior predictive distributions one future patient for each group, and evaluate (3.2) for up to five possible observations.Time points were chosen from the empirical distribution of observed times within each group.Figure 3 shows the evolution of these probabilities for each future patient.For the normal patient, we observe a steady growth of the probabilities.For the abnormal patient, however, this probability first increases and starts to decrease to values that leave no question about the classification.A possible explanation for this is the rather heterogeneous patterns found for abnormal patients.Indeed, many of these show an initial increase in the log 10 β-hCG responses (just as all the normal patients do) followed by a decrease in some of the patients.Thus, the classification probabilities for abnormal patients require more observations than the abnormal ones to reflect the correct outcome.

DISCUSSION
This paper proposes a general Bayesian framework for the classification of longitudinal profiles where the underlying models in each group or population are given by nonlinear hierarchical models.The approach has the advantage of being appropriate for classifying longitudinal profiles of data sets with an unbalanced data structure.It also has the ability to use all the information for classifying subjects over time, regardless of the number or timing of the observations.Moreover, the influence on discrimination of both the between-group and within-group components of variability can be readily quantified, and the posterior simulation scheme is straightforwardly implemented.
This approach is particularly appropriate for decision making in clinical practice where the number and times of observations are often arbitrary and depend on the progression of the patient.In this context, the approach we have presented solves one important problem.
In our study, the group "abnormal pregnancies" implies women with any complication resulting in a nonterminal delivery and loss of the fetus.This involves a wide range of pathologies including spontaneous abortion (hard to prevent in advance) and ectopic pregnancy (an acute condition that requires surgery).Measurements of serum β-hCG at regular examination times may help identifying women with a high probability of abnormal pregnancy and who may benefit from closer surveillance.For instance, in the case of ectopic pregnancy, this may help to reduce the risk of tubal rupture.Our results may help both physicians and patients make informed treatment decisions on the basis of objective staging information.Indeed, a reliable and inexpensive diagnostic test to differentiate between normal pregnancies and pregnancies with early adverse outcome might reduce the psychological tension and anxiety present in many patients.It may also help to reduce treatment cost by making it more efficient.On the other hand, for patients judged by this test to be at high risk of an unfavorable outcome, a more careful follow-up might help to reduce the associated risks.
Other markers, such as serum progesterone measurements, have been used to distinguish normal from abnormal (spontaneous miscarriage) pregnancies.A straightforward generalization of our approach would accommodate additional information when this is available, either by inclusion of more covariates or by considering other markers, thus extending the framework to a multivariate one.

Fig. 2 .
Fig. 2. Fitted curves for three patients in the normal group (patients 3, 24, 66) and three in the abnormal group (patients 27, 34, 45).The points are the actual observations.The solid lines represent the fitted curves and the dashed lines represent fitted curve ± two posterior standard deviation.

Fig. 3 .
Fig. 3. Evolution of classification probabilities for one normal (solid line) and one abnormal (dashed line) future patient as a function of the number of observations.

Table 1 .
Summary of model fitting