Statistical methods for panel data from a semi-Markov process, with application to HPV

Continuous-time, multistate processes can be used to represent a variety of biological processes in the public health sciences; yet the analysis of such processes is complex when they are observed only at a limited number of time points. Inference methods for such panel data have been developed for time homogeneous Markov models, but there has been little research done for other classes of processes. We develop likelihood-based methods for panel data from a semi-Markov process, where transition intensities depend on the duration of time in the current state. The proposed methods account for possible misclassiﬁcation of states. To illustrate the methods, we investigate a three- and a four-state models in detail and apply the results to model the natural history of oncogenic genital human papillomavirus infections in women.


INTRODUCTION
Continuous-time, multistate stochastic processes provide a useful framework for many studies of event history data (Commenges, 1999(Commenges, , 2002Hougaard, 1999;Andersen and Keiding, 2002). Most research in continuous-time, discrete-state processes has been probabilistic, and inference about processes using independent realizations from a group of individuals has been based almost entirely on settings where sample paths are continuously observed (cf. Andersen and Borgan, 1985).
However, in many instances, observations consist of the states of the individual processes at discrete time points, with no information about the types and times of events between observation times. For example, when state transitions of a process are silent events-such as the onset of an early stage of a disease before symptoms-the sample paths are observed infrequently, often resulting from diagnostic tests given during patient visits to their caregivers. Inference methods for such panel data from multistate processes have been limited. Kalbfleisch and Lawless (1985) proposed methods for the analysis of panel data for Markov models with time homogeneous transition intensities; Kay (1986), Andersen (1988), and Gentleman and others (1994) have applied these methods to cancer, diabetes, and HIV. Inference based Statistical methods for panel data from a semi-Markov process 253 on Markov models in such settings is greatly simplified, because the discrete-time process observed at prespecified time points forms a Markov process.
In many applications, however, the Markov assumption is not appropriate because the transition intensities depend on the elapsed time in the current state. For instance, in modeling the natural history of human papillomavirus (HPV), which is known to cause almost all cervical cancers, the Markov assumption would not account for the strong association between infection duration and progression to cervical abnormality (Stoler, 2000). Sexually acquired genital HPV infections in women are often transient and recurring, and are usually resolved by the host prior to the onset of symptoms. However, epidemiology studies show that when an infection by certain HPV types persists for a long period of time, it can eventually lead to clinical conditions such as high-grade cervical intraepithelial neoplasia (CIN 2 and 3) and, ultimately, cervical cancer (Stoler, 2000). When the underlying process is not a time homogeneous Markov process, the observed discrete-time process will in general have a complex structure. A further complication that can arise is that the observations of the states of the process are subject to misclassification.
Motivated by the HPV studies, this paper considers inference for continuous-time semi-Markov processes in settings where sample paths of individuals are observed only at a finite number of prespecified times, possibly with misclassification errors in the observed states. We show that evaluation of likelihood functions can be greatly simplified when the transition intensity from at least one of the states of the underlying process is time homogeneous. Section 2.1 introduces the underlying and the observed processes and outlines the general approach to likelihood methods. Section 3 presents the likelihood contributions in detail for a nonprogressive three-state process. Section 4 applies the proposed methods to a recent HPV trial and a simulated study.
The process is semi-Markov if the sequence {σ 0 , σ 1 , . . .} of consecutively occupied states forms a simple Markov chain, and the sojourn times τ n between consecutively occupied states are independent random variables with distributions that depend only on the adjoining states (cf. Cox and Miller, 1977). The probabilistic properties of a semi-Markov process can be characterized by the transition intensities, or cause-specific hazard functions, among states. Suppose that the process enters state i at its nth transition. Then the transition intensity functions out of state i, say λ i j (·), are where t denotes the elapsed time from entrance into state i, for i, j = 1, . . . , K , i = j, and n = 1, 2, . . . . Such dependence on the duration (t) in the current state (σ n ) makes semi-Markov processes distinct from Markov processes. The probabilistic properties of semi-Markov processes have been studied extensively (Cox and Miller, 1977), and inferences for settings where sample paths are continuously observed or right censored can be made by extensions of methods for ordinary failure time data (cf. Lagakos and others, 1978). Inferences for unidirectional, or progressive, processes that lead to interval-censored data have also been developed 254 M. KANG AND S. W. LAGAKOS . However, we consider settings where the independent realizations of the process X (·) corresponding to different subjects are observed only at a finite number of prespecified time points and the process may be bidirectional (nonprogressive). Consider the values of X (·) at the fixed times 0 = v 0 < v 1 < v 2 < · · · < v M , and let X = (X 0 , X 1 , . . . , X M ), where X m = X (v m ). Despite the simple probabilistic form of the underlying semi-Markov process X (·), the joint distribution of X 0 , X 1 , . . . , X M is in general complex because the time points v 0 , v 1 , . . . , v M do not correspond to transition times between states of the process. If the process is not progressive, the states visited in the sample path are not determined by the knowledge of the states occupied at visit times.
Inferences about the parameters λ i j (·) are further complicated by misclassification errors. Rather than X, suppose that the observation for a subject is given by where Y m ∈ {1, 2, . . . , K } denotes X m subject to misclassification error. We assume that the misclassification error probabilities satisfy the conditional independence assumption (2.1) That is, conditional upon the true values of X (·) at the visit times, the distribution of Y m depends only on the value of X (·) at v m . We denote these error probabilities by α ik = P(Y m = k|X m = i).

Likelihood function
The likelihood contribution for an individual can be written as where x m , y m ∈ {1, . . . , K }, c xy = M m=0 α x m ,y m , and the summation is over all possible state sequences. If the transition intensities from at least one of the K states can be assumed to be time homogeneousthat is, λ i j (t) = λ i j for j = 1, . . . , K if and only if i ∈ C, where C is a nonempty subset of {1, . . . , K }then P(X = x) in the evaluation of (2.2) is nicely simplified. A consequence of this assumption is that for any times 0 t 0 < t 1 < · · · (whether or not these correspond to visit times) and any x m ∈ C for m = 0, 1, . . . , where φ(t m ) denotes the time at which the process last entered state x m prior to t m . The semi-Markov nature of X (·), which ensures that the future of the process for times greater than φ(t m ) depends on the history of the process only through φ(t m ) and x m = X (φ(t m )), combined with the memoryless property of sojourn times in state x m yields this stationarity property of the process at times when the process is in a state belonging to C. To illustrate how (2.3) can greatly simplify the evaluation of the finite-dimensional distributions of X (·), suppose that C = {1}. Then, for example, Thus, instead of having to compute the joint probability of the 8-dimensional vector [X (t 0 ), . . . , X (t 7 )], the result in (2.3) reduces this to computing the distribution of one 3-dimensional vector and two 2-dimensional vectors. More generally, if we define, then for any positive integer L, By taking L = M and t m = v m , the result in (2.4) simplifies the calculation of the likelihood contribution in (2.2) by allowing P(X = x) to be expressed as a product of one-step probabilities of the form and of multistep probabilities of the form where 0 < 1 < 2 < · · · correspond to differences between the original visit times v 0 , . . . , v M . Calculations of the conditional probabilities in (2.5) and (2.6) can be difficult because of the unknown transitions in the underlying sample paths. However, they also can be simplified by applying the result in (2.4) to the probabilities corresponding to the potential underlying sample paths that yield the observed states in (2.5) and (2.6). represent the infection-free, currently infected, and clinical disease statuses, respectively. Then a subject who is initially uninfected can become infected for a period of time, after which the infection resolves to the infection-free status (1-2-1), or the infection leads to clinical disease (1-2-3). Alternatively, the subject could develop clinical disease from the infection-free state (1-3). For someone beginning in state 1, every sample path will consist of k visits (k 0) to state 2, followed by a visit to state 1 or 3 (if k 1).

A THREE
Suppose that individuals begin in state 1 and C = {1}, so that λ 1 j (·) = λ 1 j for j = 2, 3. The remaining transition intensities, λ 2 j (·) for j = 1, 3, are arbitrary. Let f i j (·) denote the subdensity function corresponding to λ i j (·); that is, 3)-(2.6), calculation of the likelihood contribution of a subject requires consideration of at most the conditional probabilities, (3.1) for j = 1, 2, 3, where i > 0 and 0 < 1 < 2 < · · · denote the distinct values of v j − v i and v i < v j are the visit times. When visit times are equally spaced, say every time units, these conditional probabilities simplify to for j = 1, 2, 3 and m = 1, 2, . . . . Note that for each observed path, there may be infinitely many underlying sample paths. For example, for the observed sequence 1 − 2 − 1 , the underlying sample path of the process can be of the form and so on, where the unboxed states denote the unobservable state changes of the process between visit times. We now consider the probability elements for the one-and multistep conditional probabilities in (3.1) in detail. Let P 1 j (k, t) denote the conditional probability that the process is in state j at time t 0 + t after k visits to state 2 in the interval (t 0 , t 0 + t), given that the process is in state 1 at time t 0 , for j = 1, 2, 3 and k = 0, 1, . . . . Due to stationarity in (2.3), we may set t 0 = 0. Note that P 12 (k, t) = 0 for k = 0, since there must be at least one visit to state 2. The case of j = 1 for k = 0, 1, 2 is depicted in Figure 2, where u represents the unknown time of transition from state 1 to 2 and z denotes that from state 2 to 1. In this notation, ( 3.2) The sum will be finite if sojourn times from state 2 have a guarantee time; that is, support bounded away from zero. For example, if λ 2 j (t) = 0 for 0 t G, then the upper limit in the sum is the greatest integer less than or equal to t/G.
For the calculation of P 11 (k, t), we have P 11 (0, t) = exp{− 1 (t)} and For k > 1, the probability of k transitions to state 2 is given by a convolution of probability functions differs from P 11 (k, x) in that its corresponding underlying process ends at the transition time (marked by state 1 * in Figure 2). Thus, Fig. 2. Depiction of sample paths for P 11 (k, t), k = 1, 2, 3. The boxed states represent the states occupied by the process at the visit time, the dots between the states indicate that the process has remained in the same state since the previous transition, and the arrows indicate the unobservable transitions. The asterisk next to 1 indicates the path segment that ends at transition to state 1.

258
M. KANG AND S. W. LAGAKOS Fig. 3. Depiction of a sample path for P 121 (1, 0, t 1 , t 2 ). The sample path observed to be in states 1, 2, and 1 at visit times 0, t 1 , and t 2 , respectively, with one transition to state 2 in the visit interval (0, t 1 ] is shown. We can extend the definition of P * 11 (1, x) to P * 11 (k 1 , x), and generalize the convolution function to where k 1 + k 2 = k, for k 1 , k 2 1. Next, let P 13 (k, t) denote the probabilities contributing to the calculation of P[X (t) = 3 | X (0) = 1]. Since state 3 can be reached from state 1 or state 2, P 13 (k, t) can be expressed as the sum of P (1) 13 (k, t), which denotes that state 1 immediately precedes state 3, and P (2) 13 (k, t), which denotes the other possibility that transition to state 3 was made from state 2. Then (see Appendix as supplementary material available at Biostatistics online) Having computed P[X (t 0 + t) = 1|X (t 0 ) = 1] and P[X (t 0 + t) = 3|X (t 0 ) = 1], P[X (t 0 + t) = 2|X (t 0 ) = 1] is obtained as the complement of their sum.
The ideas in one-step conditional probability calculations can easily be extended to obtain the multistep conditional probabilities in (2.6). For example, consider P[X (t 2 ) = 1, X (t 1 ) = 2 | X (0) = 1]. Define P 121 (k 1 , k 2 , t 1 , t 2 ) to be the conditional probability that X (t 1 ) = 2 and X (t 2 ) = 1, with k 1 visits to state 2 in (0, t 1 ] and k 2 visits to state 2 in (t 1 , t 2 ], given that X (0) = 1. Note that P 121 (k 1 , k 2 , t 1 , t 2 ) differs from P 11 (k 1 + k 2 , t) in that state 2 is known to be occupied in the process at time t 1 in P 121 (k 1 , k 2 , t 1 , t 2 ). To illustrate, a sample path corresponding to P 121 (1, 0, t 1 , t 2 ) is depicted in Figure 3. Details are presented in the Appendix as supplementary material available at Biostatistics online.

Estimation and model assessment
The expressions developed above can be combined to obtain an expression for P(X = x) and then the likelihood contribution for an individual from (2.2). The overall likelihood is obtained as the product of the contributions of individual subjects and will be a function of {λ 12 , λ 13 , λ 21 (·), λ 23 (·)}. If parametric forms are assumed for λ 21 (·) and λ 23 (·), the likelihood function will depend on a finite-dimensional parameter vector and standard numerical methods can be used to obtain the maximum likelihood estimator. Under the standard regularity conditions, the maximum likelihood estimators will be consistent and asymptotically normal.
To examine the adequacy of model fit, the visit patterns can be partitioned into several categories, and the observed frequencies of the categories can be compared with the estimated model-based frequencies.
An overall χ 2 statistic would then have an approximate chi-square distribution with the degrees of freedom equal to the number of independent cells minus the number of model parameters.

HPV application
Data, model assumptions, and methods. We illustrate the proposed methods with the placebo data from a completed pilot clinical trial of a candidate vaccine for HPV type 16 (Koutsky and others, 2002). There were six scheduled visits at baseline (day 0), months 7, 12, 18, 24, and 30, at which times the presence or absence of HPV type 16 (denoted HPV16) and CIN 2/3 (denoted CIN) was assessed. We considered the 699 placebo subjects, initially HPV 16 negative and free of CIN 2/3, who provided the data on the first five (388) or full six visits (311). To facilitate the computations, we treated month 7 visit as if it occurred at month 6, so that all visits would be equally spaced. The setting can be described by the four-state process depicted in Figure 4, where state 3 (state 4) corresponds to diagnosis of CIN following a scheduled visit where HPV16 was not (was) detected. This model can be viewed as an extension of the three-state model in Figure 1 in which state 3 is divided into two states based on the presence of HPV16 when CIN was diagnosed. In practice, these two CIN states would be interpreted as having been caused by HPV16 (state 4) or another type of HPV (state 3). Overall, 10 subjects were diagnosed with CIN during the study period, five of whom were HPV16 positive at the time of diagnosis. The majority of subjects were observed to be in state 1 at all six visit times (260/311) or at all five times (344/388).
We considered C = {1} and modeled the transition intensities for state 2 as step functions, incorporating the biological minimum time required for clearance of infection or progression to disease; that is, for j = 1, 4. Hence, λ 2 j (t) is dependent on time due to the corresponding guarantee time, G 2 j . We considered guarantee times of G 21 = 5 and G 24 = 6 (in months) for the HPV study. We assumed that the diagnostic tests for HPV and CIN have perfect sensitivity but may have imperfect specificity, and reestimated model parameters assuming several assumed specificities.

Results.
Estimates of λ λ λ assuming various misclassification errors are presented in Table 1. The maximized likelihood decreases sharply as either specificity declines below one and indicates that the assumption of no misclassification errors provides the best model fit. The very small estimates of λ 13 in the presence of imperfect CIN diagnostic test would suggest that the observed prevalence of women in state 3 is almost fully explainable by misclassification, implying that HPV infections of types other than 16 do not lead to CIN 2/3 during the 30-month period of follow-up. However, this would be inconsistent with the literature which suggests that various HPV types are responsible for CIN outcomes. (cf. Liaw and others, 1999). Therefore, the model assuming no misclassification errors seems more reasonable for the data.  Table 1 We found the inverse sample information to be numerically unstable and relied on the bootstrap method using 100 bootstrapped samples. The model assuming known G 21 = 5, G 24 = 6, and no misclassification errors gives the following estimates of λ λ λ and standard errors, based on four unknown parameters in the model: λ 12 = 0.0051(0.00050), λ 13 = 0.0003(0.00013), λ 21 = 0.0882(0.014), and λ 24 = 0.0031(0.0037). Following the guarantee times, the estimated risk of clearance is about 28 times higher than the risk of progression to CIN after 6 months. The conditional probabilities of HPV16 infection (entering state 2) and non-HPV16-related CIN development (entering state 3) are 0.944 and 0.056, given that a woman leaves the uninfected state (state 1). Once a woman is infected with HPV16, the conditional probabilities of clearing the HPV16 infection, P(σ n+1 = 1|σ n = 2), and progressing to CIN, P(σ n+1 = 4|σ n = 2), are 0.969 and 0.031, respectively. The mean times to clearance and progression to HPV, conditional on the next state, are about 16 and 17 months, respectively, consistent with commonly used definitions of "persistent infection" (Ho and others, 1998;Koutsky and others, 2002;Moscicki and others, 1998). The estimated cumulative probability that a woman develops CIN based on the model estimates ( Figure 5) shows a steady rise over time, with about one-third of the outcomes attributed to HPV16, also consistent with the epidemiological literature (Liaw and others, 1999;Herrero and others, 2000). HPV16 prevalence ( Figure 5) stabilizes to 7-8% after about 3 years.

. Estimates of λ λ λ under various specificities. The specificity for HPV detection is denoted by P(PCR−| HPV−), where HPV detection is assessed by a polymerase chain reaction (PCR) assay, and the specificity for CIN detection is denoted by P(Dx−|CIN−) to indicate correct negative diagnosis (Dx−)
Model fit (Table 2) was assessed by comparing the observed frequencies with the expected frequencies of the selected visit patterns. The chosen model (Model 1) appears to fit the data adequately ( p = 0.65, 8d.f.), while the goodness-of-fit results for the same model but allowing a specificity of 0.99 for the HPV16 assay (Model 2) indicate poorer fit ( p = 0.04).
Fitting the standard Markov model (Kalbfleisch and Lawless, 1985) yielded constant transition intensities (denoted by γ i j ) of γ 12 = 0.0062 (0.00067), γ 13 = 0.0003 (0.00013), γ 21 = 0.0551 (0.0089), and γ 24 = 0.0050 (0.0022). The estimated transition intensities from state 1 were very similar to those from the semi-Markov methods. The resulting cumulative incidence and prevalence curves were also similar to Fig. 5. Model-based estimation of cumulative incidence of CIN 2/3 and prevalence of HPV16. The dotted, solid, and dashed lines represent the prevalence of HPV16, the cumulative probability of overall CIN 2/3, and the cumulative probability of CIN 2/3 caused by HPV16, respectively. The step functions portray the empirical estimates from the 30-month data, and the smooth curves represent the model-based estimates over a period of 10 years. The vertical lines are the pointwise 95% confidence intervals for the predicted probabilities.

A simulated data example
To further illustrate the value of the methods, data were simulated from a four-state semi-Markov process of Figure 4 with C = {1} and λ 2 j (t) = a 2 j b 2 j (t − G 2 j ) b 2 j −1 , j = 1, 4 (Weibull hazard function). For the simulation, λ 12 = 0.025, λ 13 = 0.002, a 21 = 0.2, a 24 = 0.001, b 21 = 0.5, b 24 = 2, G 21 = 4, G 24 = 5, visits are every 6 months for 30 months and N = 5000, a sample size typical of a moderately sized Phase III vaccine trial. The simulated data produced 231 and 233 observations in states 3 and 4, respectively, by the end of the observation period, with 2252 subjects observed to be in state 1 at each visit, and yielded λ 12 = 0.0263, λ 13 = 0.00206, a 21 = 0.209, a 24 = 0.00158, b 21 = 0.510, and b 24 = 1.873, assuming that G 21 and G 24 are known. Figure 6(a) shows that the estimated cumulative CIN incidence and HPV prevalence are very close to the true probabilities. However, when a Markov model is incorrectly assumed, the biases in such estimates increase with time ( Figure 6(b)).

DISCUSSION
Panel data pose challenges for semi-Markov processes because missing information about an individual's process between the observations can contribute to the probability of being in the current state. Hence, despite the wider applicability of semi-Markov processes compared to the Markov processes, research in this area has been slim. We showed that when the transition intensities from at least one of the states of the process are time homogeneous, the expression for the joint probability in the likelihood function is tractable. In our data applications, we considered a process that begins in a state in C, so that it is not necessary to take account of the duration of time that individuals have been in this state prior to the start of the study. If X (0) / ∈ C, the proposed methods would still apply if either the subjects had just entered this state or the durations of time in this state prior to entrance into the study were known. Otherwise, the methods would need to be modified to account for the duration of time in the initial state prior to the start of the study. Methods similar to those in the work by Satten and Sternberg (1999) might be useful for this purpose.
In the HPV data example, the data came from a proof-of-principle study, thus smaller than a Phase III vaccine trial, and the number of CIN 2/3 events was small (10 events). Several Phase III HPV vaccine trials are currently underway, and are approximately five times the size of the pilot trial, and will be more suitable for the methods developed here, as the simulated data example illustrates. Without data on sexual activities and HPV types other than type 16, we assumed C = {1}, and exponential distributions with guarantee times were chosen for state 2 for model simplicity. Whereas the guarantee times limited the number of potential paths in our example, one can also limit the number of transitions that can occur in a given time interval when enumerating the potential paths. The clinical estimates for guarantee times were not clear, but we found that varying these times (from 4 to 8 months) did not change the likelihood parameter estimates substantially in our data.
The methods in the paper can be extended to allow fixed covariates in the model, for instance, to compare two treatments in a clinical trial. A one-sample model can be fit separately for each distinct covariate vector, and the overall likelihood is obtained by multiplying the likelihoods from the homogeneous groups.