Assessing the accuracy of predictive models with interval-censored data

Summary We develop methods for assessing the predictive accuracy of a given event time model when the validation sample is comprised of case \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$K$\end{document} interval-censored data. An imputation-based, an inverse probability weighted (IPW), and an augmented inverse probability weighted (AIPW) estimator are developed and evaluated for the mean prediction error and the area under the receiver operating characteristic curve when the goal is to predict event status at a landmark time. The weights used for the IPW and AIPW estimators are obtained by fitting a multistate model which jointly considers the event process, the recurrent assessment process, and loss to follow-up. We empirically investigate the performance of the proposed methods and illustrate their application in the context of a motivating rheumatology study in which human leukocyte antigen markers are used to predict disease progression status in patients with psoriatic arthritis.


INTRODUCTION
Progress towards personalized medicine will depend critically on the development and evaluation of predictive models (Ginsburg and McCarthy, 2001). Accurate predictive models provide a basis for clinical decision-making which can be informed by the anticipated outcomes patients may experience (Gerds and others, 2008). Common approaches for quantifying the overall performance of a prediction model are based on explained variation Simon, 1990, 1991), the Brier score (Brier, 1950), and loss functions (Wald, 1950). Loss functions measure the distance between the predicted and observed values and, when averaged over all possible data realizations, yield a measure of the prediction error. Validation in an independent external data set is the best way to assess the performance of a predictive model. In the absence of an independent validation sample, the prediction error is typically estimated using a modelbased estimator, an estimator of the apparent loss, or cross-validation (Korn and Simon, 1990;Efron, 2004;Rosthøj and Keiding, 2004).
There has been much research on the development of predictive models for time to event data when the outcomes are subject to right censoring. Here, rather than predicting the actual event time, it is more common to predict event status at a landmark time of interest (Uno and others, 2007). Assessing the accuracy of a predictive model is challenging even in this setting when the validation sample is subject to right censoring since individuals who are censored before this landmark time will have an unknown event status. To address this, Korn and Simon (1990) proposed use of a bounded loss function for predicting survival time, while inverse probability of censoring weights (IPW) have been used by many authors to deal with censored outcomes (Graf and others, 1999;Hothorn and others, 2006;Gerds and Schumacher, 2006;Lawless and Yuan, 2010).
When the goal is to predict event status of each individual in a validation sample at a landmark time t 0 , the discriminative ability of a predictive model can be represented by a receiver operating characteristic (ROC) curve constructed based on the sensitivity and specificity of the classification procedure. Akritas (1994) proposed an estimator based on a nearest neighbor kernel method for the bivariate distribution function of the predictor and the response subject to right censoring. An alternative simple estimator is based on the Kaplan-Meier estimate (Heagerty and others, 2000) and Yuan (2008) discussed an estimator for right-censored data using inverse probability of censoring weights.
We consider the problem of assessing the predictive accuracy of a failure time model for the event status (event-free or not) at a landmark time. We suppose data are obtained from a clinical registry of individuals with a chronic disease who are under intermittent observation. The event is only observable upon examination (e.g., by radiological examination) and so the event status is only determined at clinic visits; this observation scheme leads to case K interval-censored data (Sun, 2006). In a validation sample, the concordance of the predicted and actual event status at the landmark time can only be determined for the subset of individuals whose censoring intervals do not span the landmark time. Imputation-based techniques can be employed but related estimates can be seriously biased when the model is misspecified. Alternatively, one may restrict attention to individuals who can be definitively classified but thus leads to a biased validation sub-sample. We propose a joint model for the event process, assessment process, and loss to follow-up time which enables the computation of weights needed to address the biased validation sub-sample. Augmentation terms are also developed to increase the efficiency of the weighted estimators of the predictive accuracy and explore robustness.
The remainder of this article is organized as follows. In Section 2, we describe the nature of the observation process leading to case K interval-censored data and define notation. We describe imputationbased and IPW estimates of prediction error in Section 3.1, and AIPW estimators of the prediction error in Section 3.2 which use both imputation and weighting. Weights are estimated by introducing a joint multistate model for the event process, assessment process, and loss to follow-up; the construction and assumptions justifying the factorization of the full likelihood are given inAppendixA of the Supplementary material available at Biostatistics online. The corresponding estimators for sensitivity, specificity, and the area under the ROC are then described. Methods for assessing the discriminative power of predictive models based on ROC curves and the area under such curves are described in Section 4. The design of empirical studies are given in Section 5 where we report on the finite sample properties of the various estimators and explore robustness to model misspecification for the visit process. An illustrative example is given in Section 6 involving data from a psoriatic arthritis clinic, and we conclude in Section 7 with some general remarks.

VALIDATION SAMPLE DATA AND A JOINT MODEL
We let T denote the event time and an event time model can be represented by a progressive 2-state stochastic process with states labeled 0 and 1 representing the conditions of being event-free and postevent, respectively. We consider the setting of a chronic disease process in which mortality rates are sufficiently low that they can be ignored in terms of modeling and risk prediction. Let Z(s) = I (T s) record the state occupied at time s and {Z(s), 0 s} be the corresponding stochastic process. We assume a prediction is obtained from a parametric hazard-based model for T with which we parameterize with θ and X denotes a p × 1 covariate vector. We consider the development of a prediction model based on a training sample T comprised of n 0 independent individuals yielding data D T and letθ =θ (D T ) denote the maximum likelihood estimator. In Appendix A of the Supplementary material available at Biostatistics online, we describe the construction of the full likelihood and outline the assumptions necessary to justify the partial likelihood typically used when fitting regression models for case K interval-censored data; see also Lawless (2018, 2021). Then, we consider the evaluation of the prediction model using a validation sample V of n independent individuals subject to case K interval censoring (Sun, 2006), wherein the times of all visits and loss to follow-up are available. Specifically, suppose the intention is to follow an individual up to an administrative censoring time B, and let R (R < B) denote a random loss to follow-up time. Then if C = min(R, B), we define C(s) = I (C s) and let {C(s), 0 < s} denote the counting process for the censoring time. Individuals are observed at a baseline assessment at a 0 = 0 and then intermittently at follow-up clinic visits at times denoted by 0 < a 1 < · · · , where A r denotes the random time of the rth visit and a r is its realized value, r = 1, . . .. We let A(s) = ∞ r=1 I (a r s) be a right-continuous process counting the number of post-baseline assessments over (0, s] and let dA(s) = A(s)−A(s − ) = 1 if an assessment is made at time s and dA(s) = 0 otherwise. Since visits can only be made for individuals still on study, the visit process terminates at C so we observe dĀ(s) = Y (s)dA(s), where Y (s) = I (s C) andĀ(t) = t 0 dĀ(s). When considered jointly we refer to {C(s), A(s), 0 < s} as the observation process and we observe Z(s) only if dĀ(s) = 1.
Assessing the accuracy of a prediction model is challenging when data in the validation sample are incomplete due to intermittent observation and loss to follow-up, and either imputation or weighting may be used to address this. For right-censored data, contributions from individuals that can be used for validation are often weighted based on the censoring distribution (Robins and others, 1994;Gerds and Schumacher, 2006). When individuals are examined intermittently and the event time is intervalcensored, contributions from individuals who provide definitive validation data must be weighted in a way that addresses the more complex observation process. We consider a multistate framework for joint consideration of the event, assessment, and loss to follow-up (censoring) processes, and discuss inverse probability weighted (IPW) and augmented inverse probability weighted (AIPW) estimation of the prediction error. If (C(s),Ā(s), Z(s)), 0 < s denotes the joint process then the states in the joint state space can be represented by the triple (k C , k A , k Z ), where k C and k Z ∈ {0, 1} indicate the loss to follow-up and event status, and k A ∈ {0, 1, 2, . . .} represents the cumulative number of assessments. Figure 1(a) displays a state space diagram in which we distinguish between assessments made while an individual is in an event-free state (the second row of states) with those made following event occurrence (the third row of states). Horizontal (0, k A , 0) → (0, k A + 1, 0) transitions correspond to the occurrence of a pre-event assessment, while (0, k A , 1) → (0, k A + 1, 1) transitions correspond to assessments made following event occurrence. To distinguish pre-and post-event assessments notationally, we let dĀ − (s) = I (s < T )dĀ(s) and dĀ + (s) = I (T s)dĀ(s), where dĀ(s) = dĀ − (s) + dĀ + (s). The vertical transitions to the top and bottom rows correspond to loss to follow-up pre-and post-event, respectively. For the case K interval-censored data, the parameters of the censoring and assessment processes can be estimated based on the full data likelihood using failure time models for the loss to follow-up time and intensity-based models for the recurrent assessment process using the validation data where δ i is an indication of random drop-out; details are provided in Appendix B of the Supplementary material available at Biostatistics online.
In the case when a right censoring time is not recorded, a simplified multistate model can be considered in which the event and intermittent assessment processes are jointly considered as shown in Figure 1 ; a vertical transition reflects the event occurrence and the horizontal transitions to the right reflect the occurrence of the next assessment.

Imputation and inverse probability weighting for estimation
Let t 0 denote the landmark time at which event status is of interest. A binary status indicator Y = I (T > t 0 ) indicates that an individual is event-free at t 0 , and we letŶ (X ;θ ) denote a prediction for Y based on a model for Y |X indexed by θ . To examine the accuracy of such a predictive model with a binary response, traditional methods often involve a summary statistic reflecting overall predictive performance such as the mean squared error (Efron, 1983), or a summary statistic for discrimination ability such as the area under a ROC curve (Hanley and McNeil, 1982). We first consider the prediction error based on an absolute error loss function defined as is an estimate from a training data set. The optimal predictor is the one that minimizes the prediction error which for (3.1), this is the conditional probability of remaining event-free to t 0 given the covariate,Ŷ (X ;θ ) = P(T > t 0 |X ;θ ). If we focus on the predictor with the same support as Y , then we typically useŶ A schematic diagram enumerating possible combinations of (Y , ); the solid lines denote observations in which the event status is known at t 0 , and the dashed lines denote individual whose event status cannot be classified and hence who are excluded from the sum in (3.3); the solid dots denote the (unobserved) exact event times.
which is the optimal binary predictor when the threshold c = 0.5 is used; we focus on the binary predictor from here on as this is typically of interest to medical researchers and has been the primary focus for the predictions with right-censored data (Henderson, 1995;Graf and others, 1999;Bair and Tibshirani, 2004).
WhileŶ can typically be obtained from a prediction model as discussed in Section 2, the response Y may of course be unknown due to interval censoring. Let = I (Y is known) indicate that the observation status is known. Figure 2 shows all possible combinations of the event status and observation status indicators (Y , ), where the event times for individuals A-D are interval-censored and E-G are right censored. Either imputation or weighting typically are used to deal with the incomplete information in the validation sample. A straightforward imputation-based estimator of the prediction error based on a sample of independent observations of size n is of the form where a − and a + are the left and right endpoints of the censoring interval containing the event time. The contributions from individuals with i = 0 are based on the prediction model, so the performance of this estimator depends on correct specification of the response model and consistent parameter estimation.
To address this, we consider the multistate formulation of the joint observation-event model shown in Figure 1 to construct suitable inverse probability weights when attention is restricted to individuals in a validation sample who can be definitively classified at the landmark time t 0 . Note that there are two types of individuals with = 1. The first type is an individual who is known to have failed before t 0 becausē Figure 2 is an example of such an individual. The second type is an individual known to be event-free after t 0 for whomĀ + (t 0 ) = 0; individuals D and G in Figure 2 are examples of such individuals. Individuals B and C for whom Y = 1 and 0, respectively have censoring intervals spanning t 0 so both have = 0, while individuals E and F are right censored before t 0 and also have = 0.
With a validation sample V of n independent individuals, and restricting attention to those for whom = 1, we can define the IPW estimator of the prediction error as where the probability π i in the weight is the conditional expectation of i given (Y i , X i ), or specifically Here the distribution of the random variable i is governed by the loss to follow-up, assessment, and event processes. The weight can then be written as This IPW estimate of the prediction error is based on the biased validation sub-sample with = 1, but by weighting in (3.3) we obtain a consistent estimator of the prediction error if the joint model is correctly specified. In what follows, we describe how such a joint model can be formed. We define the intensities for the loss to follow-up and assessment processes as Expectations when T t 0 is known: Y = 0, = 1 After the occurrence of the event of interest represented by a (0, k A , 0) → (0, k A , 1) transition in Figure  1(a), the next event to occur can be a visit corresponding to a (0, k A , 1) → (1, k A + 1, 1) transition or loss to follow-up corresponding to a (0, k A , 1) → (1, k A , 1) transition. For Y to be known in this case, a (0, k A , 1) → (1, k A + 1, 1) transition must be made before t 0 . Thus, If Y is known to be one, an event-free assessment is required after t 0 in which case A + (t 0 ) = 0. Therefore, a (0, k A , 0) → (0, k A+1 , 0) transition must be made after t 0 , rather than a (0, k A , 0) → (0, k A , 1) transition for event occurrence, or a (0, k A , 0) → (1, k A , 0) transition corresponding to loss to follow-up. In this case, (3.6) Estimates of the observation process intensitiesλ a (·) andλ c (·) are required to estimate the weightsπ i based on (3.5) and (3.6) which enables construction of (3.3).
In the simplified setting where there is no loss to follow-up, expressions (3.5) and (3.6) reduce to and respectively. The corresponding multistate representation of this simplified process is given in Figure 1(b).

An AIPW estimator
The AIPW estimator (Robins and others, 1994) of the prediction error is defined here as In typical applications, augmented inverse weighted estimators have a so-called "double-robustness" property which states that if the weight model or the event (prediction) model is correct then a consistent estimator for the parameter of interest is obtained. In the present setting, the weight is dependent on the event model and therefore if the event model is incorrect then the weight model must be incorrect. Therefore, in the present setting the double-robustness property is not present; see Section 7 for further comments.
There is merit to investigating the empirical bias (EBIAS) and relative efficiency of the estimator in (3.9) however, and we do so in what follows. The AIPW method therefore yields a consistent estimator of the prediction error under the condition that only the event model is correctly specified.

ROC CURVES AND THE AREA UNDER THE CURVE
The ROC curve is widely used in health-related studies to examine the predictive performance of the classification based on patients' status. It plots the true positive rate (TPR) and false positive rate (FPR) corresponding to a set of binary predictors and may help one select the optimal predictor with a desired accuracy. Consider a set of binary predictors of survival status at a specific time t 0 , Y (X ;θ) = I (F(t 0 |X ;θ ) > c), where c ∈ (0, 1). The TPR and FPR are defined as 11) and the ROC curve is obtained by plotting TPR(c) against FPR(c) with the value of c changing from 0 to 1. We can estimate the TPR and FPR in (4.10) and (4.11) by using inverse weighting methods to estimate the component probabilities. For example, the IPW and AIPW estimators of P(F(t 0 |X ;θ ) > c, T > t 0 ) in the numerator of the TPR in (4.10) are The area under the curve (AUC) is a summary measure of the ROC curve taking values from 0 to 1, with 0.5 representing a noninformative classification procedure (Hanley and McNeil, 1982); the closer the AUC is to 1 the greater the discriminative power of the predictive score. An alternative probabilistic interpretation of the AUC is apparent from the definition which is the probability that a randomly selected individual i who is event-free (i.e. for whom T i > t 0 ) has a higher survival probability than a randomly selected individual j who has experienced the event (i.e. for whom T j t 0 ) (Heagerty and Zheng, 2005;Uno and others, 2011). As in the case of right-censored data, the ROC curve and the AUC statistic can be estimated by imputation, inverse probability weights, or augmented IPW methods. The estimators of the AUC can be written in a general form of . An imputation-based estimator of the AUC is defined as in (4.13), but wherê are used for Y i and 1 − Y j , respectively. The modeling scheme for the construction of the weights is as discussed earlier. The IPW estimator of the AUC corresponds to the use of as estimators of Y i and 1 − Y j , respectively, whereas the AIPW estimator of the AUC useŝ

Design and results of simulation studies for a Poisson assessment process
We next investigate the properties of these estimators of the performance of a prediction model based on case K interval-censored data in a validation sample. Both the training and validation samples are generated using the identical set-up. Here, we consider the setting where the response is a failure time and there are three covariates denoted X i1 , X i2 , and X i3 . In Scenario I, we take them to have marginal standard normal distributions with X i1 ⊥ X i2 , X i1 ⊥ X i3 , and corr(X i2 , X i3 ) = 0 or 0.5. In Scenario II, the covariates are binary with P(X ij = 1) = 0.5, j = 1, 2, 3. In both scenarios given (X i1 , X i2 ), the event time T i follows a Weibull distribution with a proportional hazards formulation so where θ = (λ, κ, β 1 , β 2 ) ; we set β 1 = log 2, β 2 = log 1.5, and κ = 1.25 and determine the value of λ so that P(T > 1) = F(t) = E{F(t|X i1 , X i2 ; θ)} = 0.5. We let B denote the end of an administrative observation period and choose B such that F(B) = 0.9. The assessment process is taken to be a time-homogeneous Poisson process with rate where γ 1 = log(1.1) and γ 2 = log(1.5) in Scenario I for the normal covariates and γ 1 = log(2) and γ 2 = log(2.5) in Scenario II for the binary covariates and γ 0 is set to ensure that the average number of assessments over (0, B) is controlled at μ = E{ B 0 λ a (s|X i1 , X i3 ; γ )ds} = 10. The weights are estimated by modeling the event and assessment processes using (3.7) and (3.8). A pair of training and validation data sets each with a sample size of m = 500 are generated accordingly for both Scenarios I and II and we generate 100 replicates (nsim = 100). For each simulated training data set, a Weibull regression model was fitted for the event time under the assumption of a conditionally independent visit process (Cook and Lawless, 2018). Parametric and semiparametric models were fitted to the assessment process using a time-homogeneous and a semiparametric Andersen-Gill (AG) Poisson regression model (Andersen and Gill, 1982) respectively to the validation sample data. The unweighted and imputation-based estimators only depend on the model for the event process, while the IPW and AIPW estimators depend on both the event and observation processes. The corresponding estimators are denoted as IPW-TH, AIPW-TH when the time-homogeneous Poisson regression model is used for the visit process, and IPW-AG, AIPW-AG when the semiparametric AG model is used.
The EBIAS and empirical standard error (ESE) of the estimators of the prediction error at time t 0 are summarized in Table 1 as the first column of results (η = 1), where t 0 value is taken to be the median of the marginal distribution of T . The biases of the unweighted estimators are large as expected with the imputation-based estimators performing well since the correct response model is fitted. Under correct specification of the assessment model, the proposed IPW and AIPW estimators have relatively small biases compared to the unweighted estimators, while the variability, as reflected by the ESE, is greater than that found for the imputation-based estimator. Note that the AIPW estimators are more efficient than the IPW estimators here, but this estimator relies on correct specification of the prediction model.
In Table 2, we report the empirical properties of the estimators of the AUC with a similar layout of the results. The column of results for the Poisson visit process (η = 1) shows a similar pattern of behavior as in the analogous column of Table 1 with one exception; the ESE of the estimates based on AIPW (either under the time-homogeneous or semiparametric visits process model) are not smaller than the estimators based on the IPW approach.

Design and results of studies for renewal assessment processes
We next investigate the impact of misspecification of the assessment model. We consider the scenario in which the assessment process is governed by a non-Markov renewal process but analysis of the assessment process is still based on a parametric or semiparametric Poisson process. The event times are generated as described in Section 5.1, but here the gap times between consecutive assessments are Gamma distributed with shape η and rate exp(γ 0 + X i1 γ 1 + X i3 γ 2 ), where η = 1.25 and 2; the values of the other parameters are the same as in Section 5.1 for both the normal and binary covariates. For each parameter setting, 100 data sets with sample sizes of m = 500 are simulated. The empirical performance of the estimators of the prediction error under this setting are summarized in the second two sets of columns of Table 1 (η = 1.25 and 2). Since the assessment process is non-Markov, the further the shape parameter η is from 1, the greater the difference from the time-homogeneous Poisson process, and hence the greater the extent of misspecification. We find that EBIASes of the IPW-TH and IPW-AG estimators increase as η increases, but the AIPW-TH and AIPW-AG estimators maintain relatively small bias. The ESEs of the AIPW estimators are smaller than those of the IPW estimators, which again demonstrates the improved efficiency of the AIPW estimators for the prediction error. The EBIAS and standard error (ESE) of the unweighted, IMPUTED, IPW, and AIPW estimators of the AUC at the landmark time t 0 are summarized in the last two sets of columns Table 2 for both scenarios where similar findings can be seen. In broad terms, we found that when the joint model was misspecified there was a consequent EBIAS in the IPW estimators, but that this bias is smaller than that of the unweighted estimator for the misspecification considered here. When the event model is correctly specified, but the assessment model is misspecified, the AIPW estimators have a comparable performance with those under correct specification of the assessment model; the bias remains small and the standard errors are modest.

PREDICTION OF ARTHRITIS MUTILANS IN PSORIATIC ARTHRITIS
The University of Toronto Psoriatic Arthritis Clinic registry was launched in 1977 to study this complex disease (Gladman and others, 2008;Gladman and Chandran, 2011). Patients in this registry undergo a detailed clinical and radiological examination upon entry to the clinic, and provide serum samples for genetic testing. Follow-up clinical and radiological assessments are scheduled annually and biannually, respectively, in order to track changes in joint damage. At each radiological assessment the degree of Training data including randomly selected 508 subjects is used to develop the prediction model and the test data containing the remaining 507 subjects is used to evaluate the predictive performance. The assessment model is fitted based on the entire data.
damage is recorded in 64 joints on a five-point scale. To date, 1495 patients have been recruited to the registry, and 1185 of these have undergone genetic testing to determine their human leukocyte antigen (HLA) profile. The genotypes of HLA-A, HLA-B, and HLA-C alleles were collected, and a total of 70 HLA markers were identified as of interest a prior; 15 of these markers had a frequency in the sample of less than 1% and so were excluded from further consideration. While there is no clinical consensus on how precisely to define arthritis mutilans, it represents a state of significant joint damage arising from an extreme form of the disease; here, we define it as present if an individual has 5 or more joints with the advanced stage of damage according to the modified Steinbrocker score (Rahman and others, 1998). We consider a data set containing 1015 patients from the University of Toronto Psoriatic Arthritis Clinic with the median time from the diagnosis of psoriatic arthritis to last radiological assessment being 12.0 years (lower quartile = 5.1, upper quartile = 22.5). Plot of the Nelson-Aalen estimate of the mean function for the number of radiological assessments is given in Figure 3(a). Among the 1015 patients, a total of 874 (86.1%) patients were not observed to develop arthritis mutilans and hence provided right-censored times, whereas 141 (13.9%) patients were known to develop arthritis mutilans yielding interval-censored times since they had a visit with a damaged joint count of 5 or greater. The 25th, 50th, and 75th percentiles of the censoring interval lengths for the arthritis mutilans patients were 2.6, 7.9, and 15.0 years respectively. Figure 3(b) contains a nonparametric estimate and pointwise 95% confidence bands (Turnbull, 1976) for the cumulative distribution function of the time from onset of psoriatic arthritis to arthritis mutilans. The estimate reflects a steadily increasing risk with roughly 20% of psoriatic arthritis patients developing the condition within 20 years of disease onset.
Our interests lie in identifying which among the 55 HLA markers are associated with the development of arthritis mutilans from the onset of psoriatic arthritis, and assessing the predictive performance of these models.
We partition the total of 1015 patients into training (n 0 = 508) and validation (n = 507) samples randomly. We adopt a proportional hazards model with a Weibull baseline hazard to develop the prediction model of the time from onset of psoriatic arthritis to arthritis mutilans by using the training sample. All models controlled for four demographic variables including age at clinic entry, sex, family history of psoriasis, and family history of psoriatic arthritis. Semiparametric analysis is carried out to model the inspection process, that is, modeling the gap times between two consecutive inspection times by AG model. The covariates in the inspection process model are the four demographic variables and the set of the 19 HLA markers.
Having fitted these models, we next apply the inverse weighting approach to estimate the prediction errors and to assess the discriminative abilities for the models in the validation sample. Figure 3(c) shows the results in terms of prediction error curves by using the unweighted, IPW, and AIPW with t 0 ranging from 0 to 40 years after the diagnosis of psoriatic arthritis. The unweighted estimates are greater than the weighted estimates since the unweighted estimators do not account for the unclassified portion in the sample. Figure 3(d) shows the estimates of the area under the ROC as a function of t 0 . The ROC curves at t 0 = 10 and 20 years after diagnosis of psoriatic arthritis for both models with IPW and AIPW estimators are shown in Figure S1 of Supplementary material available at Biostatistics online; the corresponding estimates of the AUC's are also given in the legend. We conclude from these figures that the model including HLA markers has a better predictive performance in terms of a higher AUC and lower prediction error compared with the model including the demographic variables only.

DISCUSSION AND FUTURE RESEARCH
In this article, we described imputation techniques and use of inverse probability weighting to estimate predictive performance in settings when validation samples have case K interval-censored data arising from intermittent assessments of individuals. The simulation studies demonstrated that the proposed imputation-based, IPW, andAIPW estimators led to better performance compared to the simple unweighted estimators. For many data sets involving interval censoring only the left and right endpoints of the censoring intervals are reported; such information are insufficient to model the inspection process and implement the proposed method. Implementation hinges critically on the availability of all inspection times over the course of observation (i.e., case K interval censoring). In the motivating study such data are available, and the proposed methods were illustrated by an application in which a semiparametric analysis is carried out to model the inspection process and a parametric proportional hazards model was used to model the event time with a Weibull baseline hazard function. More flexible approaches could be considered for model fitting including the use of models with piecewise constant baseline hazards or penalized regression to select the prediction model as in Wu and Cook (2015).
The expressions for the estimated weights in Sections 3.1 and 3.2 are based on the joint model for the response and observation processes, whereas the imputation-based estimator estimates the conditional probability of failure by the landmark time t 0 given the respective censoring interval. A hybrid approach has been suggested by a referee for the setting in which the loss to follow-up time is reported. Here, attention could be restricted to individuals who have not been lost to follow-up before the landmark time.
An imputation-based approach could then be adopted among these individuals with inverse probability of censoring weights (Graf and others, 1999;Hothorn and others, 2006;Gerds and Schumacher, 2006;Lawless and Yuan, 2010) used to address the fact that this imputation-based estimator is applied to a biased sub-sample of individuals. A study of the robustness and efficiency of this approach compared to the methods presented here would be of interest. Wu and Cook (2018) develop methods for variable selection with truncated and interval-censored data. While we have dealt with the latter complication here, it is less clear how one might assess predictive accuracy when samples are chosen subject to truncation, but this feature is often present in the development of predictive models based on data acquired from different sources with distinct selection criteria. This is a topic of ongoing research.
The availability of external validation data is highly desirable when we consider the assessment of the predictive performance. There are several clinical registries of individuals with psoriatic arthritis in Spain (Queiro and others, 2003), Ireland (Winchester and others, 2012), and Newfoundland (Rahman and others, 2011), most of which are devoted to some form of genetic research aiming to identify prognostic markers. We may consider use of such external validation data sets and are currently investigating the extent of follow-up in these registries. An issue with the Newfoundland cohort is that the distribution of genetic markers and other attributes among the members of this registry is different than those individuals in the University of Toronto Psoriatic Arthritis Clinic Registry. As a result, we might expect the estimated prediction error based on such an external validation sample to be quite different than the estimates obtained by cross-validation based on the Toronto registry. Approaches for calibrating the covariate distribution for a more reasonable expectation of the predictive accuracy in this setting are also of interest but these will require detailed information of the Newfoundland cohort.

SOFTWARE
Software in the form of R code, together with a sample input data set is available on GitHub and can be requested from the corresponding author (ywu@nankai.edu.cn).

SUPPLEMENTARY MATERIAL
Supplementary material is available online at http://biostatistics.oxfordjournals.org.