On the relation between initial value and slope

Suppose measurements of a particular feature are collected at baseline and at a number of subsequent time points and that for each individual there is a roughly linear trend in time. This paper takes three approaches to testing whether there is a relation between the initial value and the slope. It also considers whether the initial value for an individual is a useful predictor of the slope for that individual. The problems are formulated in terms of regression models with random coefﬁcients. The solutions are illustrated using data from an observational study of clinical correlates of disability and progression in Huntington’s disease.


INTRODUCTION
Suppose that patients are measured for a particular feature at baseline (t = 0) and at a number of subsequent time points and that for each patient there is a roughly linear trend in time. Is there a relation between the initial value and the slope, and is the initial value for an individual a useful predictor of the slope for that individual? Any analysis must take account of the phenomenon of regression to the mean first noted by Galton (1869). For a recent account from a number of perspectives, see the series of papers edited by Senn (1997) in the issue of Statistical Methods in Medical Research devoted to this topic.
The present paper was motivated by a specific application to be described in Section 2. Three versions of the problem are given and formulated in terms of a regression model with random coefficients; see, for example, Laird and Ware (1982). A theoretical discussion more from first principles is outlined in Section 3 and the conclusions from the specific study are sketched in Section 4. Section 5 discusses some further developments.
The multilevel model approach to longitudinal data analysis is well known and our suggested solution implicit in its formulation; see, for example, Burton et al. (1998), Omar et al. (1999), Verbeke and Molenberghs (2000), or Leyland and Goldstein (2001). Some care, however, is needed in applying the approach. For example, we are unaware of papers in the medical statistical literature where such a model 396 K. BYTH AND D. R. COX has been correctly used to address the questions posed above. Even when a multilevel model is properly fitted, it is easy to overlook the effect of regression to the mean in interpretation.
In this paper we identify three different formulations of the issue and indicate solutions which avoid the regression to the mean effect.

A STUDY OF HUNTINGTON'S DISEASE
The data used to illustrate the above ideas were collected as part of an observational study of clinical correlates of disability and progression in Huntington's disease (Mahant et al., 2003). Patients with a definite diagnosis of Huntington's disease were assessed at their initial and subsequent routine clinic visits for motor, cognitive, and functional impairment using the Unified Huntington's Disease Rating Scale (UHDRS) (Huntington Study Group, 1996). We concentrate here on the total motor score (TMS) and the total functional capacity (TFC) of the UHDRS. The TMS can take integer values from 0 (normal) to 124 (maximally abnormal) while the TFC takes integer values between 0 (maximal disability) and 13 (normal).
Our study population consists of 83 patients who had more than one clinic visit and were followed up for at least 1 year at Westmead Hospital, a major Sydney teaching hospital and tertiary referral center. The median duration of follow-up was 5.2 years (interquartile range, 3.1-6.8 years). The median number of visits per patient was 9 (interquartile range, 5-12). The median baseline TMS and TFC scores were On the relation between initial value and slope 397 38 (interquartile range, 26-55) and 8 (interquartile range, 5-10), respectively. Figure 1 illustrates typical TMS and TFC profiles for a random selection of 10 patients.

Formulation
For the ith patient (i = 1, . . . , k), suppose observations Y i j ( j = 0, . . . , r i ) are available at times t i j 0, where t i0 = 0 defines the time origin for individual i and is taken at the first observation. We take as an initial model where the ε i j are uncorrelated random terms of zero mean and variance σ 2 ε , and (µ i , γ i ) are, respectively, the expected value at t i j = 0 and the slope for the ith individual. We write where (ζ i , η i ) are random terms of zero mean and var(ζ i ) = σ 2 ζ , var(η i ) = σ 2 η , cov(ζ i , η i ) = ρ ζ η σ ζ σ η . We now distinguish different versions of the problem sketched in Section 1. In the first, we are interested in the relation between the expected value µ i at time zero and the slope γ i . This relation is summarized in ρ ζ η or by the regression coefficient Especially if the ε i j represent primarily measurement error or uninteresting 'noise,' the dependence of interest may be best encapsulated in β γµ . Note that the conditional variance of η i given ζ i is σ 2 η (1 − ρ 2 ζ η ). A second possibility, directly relevant for empirical prediction, is to relate γ i to Y i0 . Under assumptions (3.1) and (3.2), the regression coefficient of γ i on Y i0 is A different approach to assessing dependence on the baseline value is to consider Y 0 purely as an explanatory variable.

Statistical analysis
The model (3.1)-(3.2) can be fitted by PROC MIXED in SAS or lme in R or S-PLUS, the defining parameters being estimated preferably by restricted maximum likelihood (REML). In particular, estimates of β γµ and β γY 0 can be found and confidence intervals (CIs) obtained via the asymptotic standard errors. We now give a discussion from first principles which has the advantage of showing explicitly the relation with regression to the mean and with the contributions of individual patients. For the analysis to study β γµ , we start by fitting linear least square regressions to the data from each patient. For the ith patient this givesγ i andμ i = Y i· −γ iti· . Direct evaluation under the model (3.1)-(3.2) gives the covariance matrix of (μ i ,γ i ) as are averages for the ith individual. Note that if σ ζ η = 0, that is, if baseline mean and slope are uncorrelated, the estimated covariance is negative, a manifestation of regression to the mean.

398
K. BYTH AND D. R. COX A pooled estimate of the parameter σ 2 ε is obtained from the residual mean square after fitting separate regression lines for each patient. Various simple, if somewhat inefficient, estimates of the remaining parameters are now available. In particular, we may calculate the sums of squares and products about the mean of the individual (μ i ,γ i ) and equate these to their expectations. This gives for the expectations of, respectively, the sum of squares of theμ i , theγ i and the sum of products ofμ i ,γ i : As noted above, the version of the problem in which the regression of slope on the observed baseline value is of interest can be approached via the above analysis, taking β γY 0 in (3.4) as the primary parameter of interest. An alternative approach, which assumes less, is as follows. We use the baseline value as an explanatory variable, and hence to be treated conditionally when modeling Y i j for t > 0. This covers the possibility that the baseline value, while a predictor of slope, is not derived from the system (3.1)-(3.2). For example, when the baseline value is measured at diagnosis it may be subject to a bias that does not, however, affect its usefulness as a predictor of slope. Thus, we fit the following linear regression models to the data for i = 1, . . . , n 2 , j = 1, . . . , r i , where n 2 is the number of subjects with r i 2: (3.10) The random variable δ i in the model γ i = γ + β(y i0 −ȳ ·0 ) + δ i represents the variation in slope (t > 0) not accounted for by linear regression on y i0 . The sum of squares for estimating σ 2 δ = var(δ i ) is the difference between the sums of squares for fitting constants in models (3.8) and (3.9). The expected value of this difference under the model It is therefore possible to estimate σ 2 δ and σ 2 ε and to test the significance of the σ 2 δ term. Additional explanatory variables could be included in the regression term. One way to set out the calculations is in the form of an analysis of variance.

RESULTS FOR HUNTINGTON'S DISEASE STUDY
If normal error structures are assumed in (3.1)-(3.2), R or S-PLUS can be used to fit linear mixed effects models to TMS and TFC. Because TFC takes only integer values over the narrow range [0, 13], a suitably transformed version, namely TFC trans = log (0.5 + TFC) (13.5 − TFC) ,  was also considered. The resulting estimates of parameters in (3.1)-(3.4) and their associated 95% CIs obtained using REML are shown in Table 1. The maximum likelihood estimates are virtually the same. First consider the results for TMS. Sinceρ ζ η is here 0.06 with a 95% CI (−0.23, 0.34), there is no evidence of association between the individual expected value µ i at time zero and the slope γ i . A plot of the residuals η i versus ζ i for the fitted model (3.1)-(3.2) is shown in Figure 2(a). This plot suggests no 400 K. BYTH AND D. R. COX obvious departures from the underlying assumptions and appears consistent with the independence of η i and ζ i and hence of the intercept µ i and slope γ i .
Similarly, there is no evidence of association between the baseline value of TMS and its rate of change over subsequent visits sinceβ γY 0 is 0.009, 95% CI (−0.019, 0.037). The analysis of variance associated with (3.8)-(3.10) together with (3.12) yields estimates of 6.48 and 1.81 for σ ε and σ δ , respectively. These are consistent with the earlier REML estimates for the system (3.1)-(3.2) (t 0) given in Table 1. There the residual error estimate was 6.66, 95% CI (6.28, 7.07), and the estimated error for individual slopes was 1.68, 95% CI (1.26, 2.25). Figure 2(b) illustrates the individual TMS regression coefficients (for t > 0) plotted against the baseline value. This plot is consistent with the absence of appreciable association between these variables. Now consider the TFC scores. Hereρ ζ η is −0.37, 95% CI (−0.62, −0.04), andβ γY 0 is −0.042, 95% CI (−0.074, −0.011). At first glance there seems to be reasonable evidence of a negative association between the initial value and the slope, that is, more rapid deterioration of TFC among those with better initial scores. On the transformed scale, however,ρ ζ η is −0.22, 95% CI (−0.52, 0.14), andβ γY 0 is −0.022, 95% CI (−0.051, 0.006), both not quite significant at the 5% level. The normal probability Q-Q plots of the residuals ε i j , ζ i , and η i obtained fitting model (3.1)-(3.2) to the raw and to the transformed TFC data are shown in Figure 3. There is some evidence of departure from normality which is not entirely removed by the transformation. This departure may be a result of 'floor and ceiling effects' due to the limited possible range of observations. Figure 4(a) is a plot of the residuals η i versus ζ i for the model (3.1)-(3.2) fitted to the raw TFC data. Note that patients with ζ i < −3 all have η i > 0 while those with ζ i > −3 have, on average, negative η i .  Closer examination revealed that patients corresponding to points in the top left of Figure 4(a) had TFC scores 3 at baseline and could deteriorate over time only to a score of zero, the lowest possible for TFC. In particular, there were two profoundly affected subjects who entered the study with zero TFC scores and remained at this level throughout (a total of 3.0 and 7.8 years). Such behavior would be expected clinically since Huntington's disease is a chronic progressive illness. These patients correspond to the larger points on the far left of Figure 4(a).
Omission of these two patients from the analysis effectively normalized the residuals from the model of the transformed TFC scores, producing the scatterplot of η i versus ζ i shown in Figure 4(b). On the transformed scale, the reduced data set yielded estimates forρ ζ η andβ γY 0 of −0.30, 95% CI (−0.60, 0.08), and 0.003, 95% CI (−0.017, 0.024), respectively. The estimated correlation between the individual rate of change of transformed TFC values and the expected transformed TFC at time zero is negative and not quite significant at the 5% level. There is no obvious association between this rate of change and the baseline transformed value. Figure 5 shows the individual regression coefficients (for t > 0) plotted against the baseline value for (a) raw TFC scores and (b) transformed TFC. Note that the larger points at the bottom of the plots are associated with the three patients who each had only two postbaseline readings and therefore the least accurate slope estimates. There is no obvious relationship between the individual slopes for t > 0 and the baseline values for either the raw or transformed TFC scores. The estimates of σ ε and σ δ obtained using (3.12) and the related analysis of variance are 1.430 and 0.298 for the raw TFC and 0.550 and 0.121 for the transformed TFC, respectively. These values are virtually equivalent to the earlier REML estimates for the system (3.1)-(3.2) (t 0) given in Table 1.

DISCUSSION
The substantive conclusion is that for TMS there is no firm evidence of a relation between initial value and slope. For TFC, a scale with a very limited range, it is desirable both to transform the response scale to reduce the effects of end-point constraint, and also to exclude the two patients whose initial (and all subsequent) scores were zero. For the remaining patients, although there is a suggestion of a negative relationship between the estimated individual rate of change of transformed TFC values and the expected transformed TFC at time zero, the evidence for this conclusion is not decisive. Analysis of a larger set of data is desirable. It seems unlikely that, even if confirmed, the relation is strong enough for predictive purposes unless the slope for an individual patient could be shown to depend appreciably on an explanatory variable, thereby eliminating a substantial source of random variation. A final point is that in any protocol for a new study in which TFC is an important outcome variable it might be advisable either to exclude patients with a very low initial score or at least to give provision for them to be considered separately.
The analysis illustrates a number of methodological issues. Quite apart from the important need to escape entrapment by regression to the mean, there are two broad distinctions of formulation. One is that between regression on the initial underlying population mean versus regression on the initial observed value. The other is the distinction between treating the initial value as part of the underlying random system and treating the initial value as distinct, for example, related to the reasons why a patient enters the study. For these data any association with observed value at entry is so weak as to be useless for predicting slope. There is, however, some evidence of a modest relationship between the underlying initial mean and slope for TFC. In future studies, it may be worthwhile trying to measure the initial value more precisely.
The various estimates reported in Table 1 are readily available by fitting a carefully formulated multilevel model using likelihood-based methods. This approach on its own is likely to be rather dangerous, however, and may overlook important aspects of the data. Our more elementary analyses are based on fitting separate regressions to each patient, examining the results graphically and constructing an analysis of variance table. This table is closely connected to the analysis of covariance table common in the older literature. The estimate of the component of variance between regression coefficients obtained via the analysis of variance table equating mean squares to their expectation is very similar to that obtained by the slightly more efficient REML procedure.
The details of the statistical analysis illustrate a number of general points. First is the need to avoid totally a regression to the mean effect. Then there is the more subtle aspect of distinguishing between regression of slope on the observed value of the initial response and regression on some underlying notional true value. The former permits the possibility that the initial observation is predictive of slope even though it is not part of the main stochastic system. The suggested approach to analysis also highlights other potential problems such as apparently anomalous individuals. These latter issues are likely to arise quite often when multilevel random variability is present.