Abstract

The use of reliable change (RC) statistics to determine whether an individual has significantly improved or deteriorated on retesting is growing rapidly in clinical neuropsychology. This paper demonstrates how with only basic test–retest data and a series of simple expressions, the clinician/researcher can implement the majority of contemporary RC model(s). Though sharing a fundamental structure, RC models vary in how they derive predicted retest scores and standard error terms. Published test–retest normative data and a simple case study are presented to demonstrate how to calculate several well-known RC scores. The paper highlights the circumstances under which models will diverge in the estimation of RC. Most importantly variations in individual's performance relative to controls at initial testing, practice effects, inequality of control variability from test to retest, and degree of reliability will see systematic and predictable disagreement among models. More generally, the limitations and opportunities of RC methodology were discussed. Although a consensus on preferred model continues to be debated, the comparison of RC models in clinical samples is encouraged.

Introduction

There are many instances where neuropsychologists may wish to monitor change in test scores for an individual over time, for example, following an insult (sports-related concussion) or intervention (temporal lobectomy for intractable epilepsy, coronary artery bypass grafting). In this role, the use of reliable change (RC) indices in various disciplines of psychology has become increasing popular. RC methods fundamentally seek to evaluate whether a statistically significant difference in test scores has occurred for an individual that cannot be accounted for by systematic error (e.g., practice effects) or measurement error (e.g., test unreliability). There are several variations of RC reported in clinical psychology and clinical neuropsychology literature. This report seeks to demonstrate that most forms of RC used in neuropsychology can be derived from any test–retest descriptive data. And also to further examine the similarities and differences between models of RC.

It is well recognized that many neuropsychological tests, particularly performance measures, are susceptible to practice effects (McCaffrey & Westervelt, 1995). Thus, test–retest normative data are becoming more available, often with RC norms provided (Strauss, Sherman, & Spreen, 2006). In the field of neuropsychology, two basic models that attempt to correct for practice appear to be most popular. The mean practice effect model described by Chelune, Naugle, Luders, Sedlak, and Awad (1993) and the linear regression based model of McSweeny, Naugle, Chelune, and Luders (1993). The latter model has been said to account for practice effects as well as regression to the mean on retesting. When RC norms are not available, or the clinician/researcher's preferred method is not presented, or the exact formula used is not clearly stated, various RC expressions can be derived from basic retest descriptive data. A recent example of retest norms from the literature (Woods et al., 2006) will be used to demonstrate how to derive elements of various RC expressions. Moreover, illustrative case examples were used to highlight similarities and differences in the RC models.

A Fundamental Structure

The RC index attributed to Jacobson and Truax (1991) is a statistical method for determining whether a person's outcome score had significantly changed on re-assessment. There have been many variations proposed on how to calculate RC, yet all can be reduced to a simple fundamental expression which yields a standard (Z) score, where Y is the actual retest score, Y′ the predicted retest score, and SE the standard error score;  

formula

An RC score exceeding ±1.645 is often considered significant corresponding to a 90% level of confidence. Naturally, more stringent or relaxed significance or confidence levels may be used, with some authors even reporting 70% (with 80% and 90%) confidence intervals (e.g., Hermann et al., 1996). The varied approaches differ only in how they determine each of the values above. For example, in the well-known expression of Jacobson and Truax (1991), Y is the retest score, Y′ the initial test score, and SE the standard error of measurement for difference scores. For a review of many of the RC variations, the interested reader is directed to Maassen (2001, 2004) and Temkin, Heaton, Grant, and Dikmen (1999). However, it is possibly not well known that most RC models can be derived with the basic descriptive statistics of the test–retest normative group, including estimates of a practice effect, reliability, and variability. Moreover, one can readily convert one model to another. This paper seeks to elucidate the mathematical expressions needed to convert test–retest data into RC expressions. To do this, the methods by which Y′ and SE are calculated will be explained separately. The reason for separating the discussion of predicted retest scores and error terms is that different RC models use a varied combination of these elements. Moreover, this paper seeks to highlight differences in results obtained by methods presented. In this way, it is intended that the reader become more aware of not only the calculations of RC, but the implications of the method chosen.

Deriving the Predicted Retest Score (Y′)

Under the original Jacobson and Truax (1991) approach, the predicted score was equal to the initial test score for the individual. The authors were examining outcome from psychotherapy and were not concerned with learning effects. However as stated, practice effects are frequently encountered in neuropsychological testing. To correct for this, Chelune and colleagues (1993) estimated the retest score through adjusting the initial test score by adding the mean practice effect seen in a control group. In this way, the person had to significantly exceed a practice effect to be considered changed. Under the Chelune model, the predicted retest score is simply the initial test score plus the mean practice effect seen in a control group (see Y′[1] Appendix A). Hence, the Chelune predicted score is a uniform correction, unlike the methods to follow.

In the same year, McSweeny and colleagues (1993) introduced a standard regression-based model. The McSweeny model corrected for a mean practice effect, but also adjusted for regression to the mean on retesting. Under a regression model, a more extreme score on initial testing will undergo a greater adjustment toward the mean on retesting. This effectively varies the change score required to reach significance, such that low scorers must obtain a greater raw change and high scorers a lesser raw change to be considered significantly changed. The magnitude of the adjustment is also determined by the mean practice effect, measurement error, and inequality of variances. Under the simple McSweeny model, the expected retest score is calculated through simple regression of retest scores on initial test scores taken from a control group (see Y′[3] Appendix A). In fact, the Chelune predicted score can be seen as a special form of the McSweeny predicted score when the slope of the least-squares regression line (b) =1, and the constant (a) is the mean practice effect. This will occur when initial and retest variability are equal, or when regression to the mean due to measurement error is compensated by divergence from the mean due to increasing retest variance (seeMaassen, Bossema, & Brand, 2006). As test–retest reliability approaches unity (rXY = 1), these two methods will yield converging predicted scores. The McSweeny predicted score also adjusts for regression to the mean due to measurement error; the degree of correction is inversely proportional to test reliability. Further correction is made in the presence of inequality of test and retest variability. When initial test variability exceeds retest variability (SX > SY), a greater correction toward the mean is made. When retest variability is larger than initial variability (SX < SY), this correction will serve to move the predicted score away from the control retest mean. In sum, the McSweeny predicted score is affected by the relative position of the individual's initial score, test reliability, and variance inequality.

A variation on correcting for regression to mean in the predicted score without practice has been described by Speer (1992). Speer, and later Charter (1996), proposed adjusting the actual score to a classic “true” score. This was achieved by using the control mean as the initial estimate and then correcting this by regressing the initial deviation score (see Y′[2] Appendix A). Where measures of internal consistency are available, these have been entered as r in the Speer predicted score; however as these are often not applicable to performance measures, retest reliability has been substituted (Basso, Bornstein, & Lang, 1999). This adjustment does not account for practice effects, however. When applied to neuropsychological data, it was not clear how practice effects should be considered when attempting to discover true change or RC. Substituting the control retest mean (MY) for the generic control mean essentially adjusts for the mean practice effect. The Speer predicted score is affected by the performance of the individual relative to controls and test reliability, but not by inequality of variances as it does not account for retest variability. As expected, the Speer and the McSweeny predicted scores will be identical when control group initial and retest variances are equal. The Speer predicted score adjusts for regression to the mean due to unreliability of the test, whereas the McSweeny score makes further adjustment for inequality of variances.

Maassen and colleagues (2006) have argued that the simple regression approach provides a biased estimate of the true retest score. These authors have suggested that the slope of the line (b) coefficient be adjusted by dividing by the true initial test reliability (ρXX); badj = b/ρXX. Maassen and colleagues suggested that using the test–retest rXY coefficient in the place of ρXX is appropriate in most settings, such that badj = b/rXY. Algebraically, and fortunately, badj can actually be reduced to SY/SX or the retest standard deviation divided by the initial test standard deviation (see Y′[4] Appendix A). The adjusted slope of the line is then used to derive an adjusted constant, and hence an expression for the predicted score. The Maassen calculation of the predicted score is thus not directly affected by test reliability; however, it is affected by inequality of initial and retest variability. When control group variances are equal, the Maassen predicted score will equal the Chelune predicted score. The degree of correction made depends on the magnitude of the inequality of variances and relative position of the individual at initial testing.

The issue of whether correction for practice under the Chelune and colleagues (1993) model should be made only when the difference between means is significant in the control group has yet to be properly resolved. It seems many authors make the adjustment irrespective of significance. It has been adequately demonstrated that failure to correct for practice will often reduce sensitivity (Temkin et al. 1999). However, an unreliable correction for practice will likely bias the predicted retest score, which underpins the need for large normative samples. The issue of significance is less problematic for the regression models given the greater comparative power of the prediction, and mean practice is automatically accounted for and further adjustment to the predicted score is made depending on test reliability and the initial test score relative position.

It has also been recognized that differences in initial and retest variance of the controls will affect estimates of standard error (Maassen, 2004). It should be clear that this will also affect predicted scores for regression-based models. Maassen (2004) has described the phenomenon of “differential practice” which essentially refers to an inequality of variance from initial to retest. When retest variance is less than initial variance then further regression to the mean will be seen in the predicted score, independent of that seen due to test unreliability. If retest variance exceeds initial variance divergence from the mean will be seen in the predicted score. The presence of differential practice or inequality of variance will affect RC models differently as will be demonstrated. Maassen (2005) presented the following expression to test whether difference between variances in a correlated sample is significant, following a t distribution with N − 2 degrees of freedom, where SX is the initial test standard deviation, SY retest standard deviation, N the control sample size, and rXY the test–retest reliability coefficient: 

formula

In summary, it can be seen that with the simple statistics provided in a retested comparison or normative group predicted retest scores can be readily derived from expressions that require only basic test–retest statistics. Moreover, it should be recognized that as reliability approaches perfection the estimates will converge. The Chelune predicted score is not affected by any parameter other than the mean practice effect. The adjusted Speer predicted score is affected the relative position of initial test performance for the individual and test reliability. The Maassen predicted score is affected by initial test performance and inequality of variability in the retest control data. Finally, the McSweeny predicted score is affected by the initial test performance, test reliability, and inequality of control initial and retest variability. Comparison of the actual predicted scores yielded by the expressions above will follow in a subsequent section using a worked case example.

Deriving the Standard Error Score

The “correct” calculation of the error term in non-regression RC models is an ongoing source of debate (Hinton-Bayre, 2004; Maassen, 2004; Temkin, 2004). Jacobson, Follette, and Reverstorf's (1984) original error term was altered by Christensen and Mendoza (1986). The latter authors suggested the use of the standard error of the difference, which in computation was equivalent to the standard deviation of the difference scores observed in the retested control group (see SE[1] Appendix B). In 1991, Jacobson and Truax reported on a more simple approach that can be used when no retest data are available. This error is really just a collapsed version of the Christensen and Mendoza error term when we assume that test and retest variances are equal (see SE[2] Appendix B). It has been argued that pooled error estimates are preferable whenever retest data are available (Abramson, 2000). Maassen (2005) has argued that the Christensen and Mendoza error term departs from classical test theory and the formula the latter authors presented is actually a misrepresentation of the individual change variability of interest. Maassen provided an SE he argues is more consistent with classical test theory (seeMcNemar, 1963), which pools initial and retest variance (see SE[3] Appendix B). Maassen also notes that the difference between the two expressions (SE[1] and SE[3]) will increase as test and retest variances differ and reliability decreases, such that the Maassen (2004) standard error will be less than the Christensen and Mendoza (1986) standard error. As expected, the difference between the two error terms has been shown to be negligible in reliable measures in controls (Hinton-Bayre, 2004), but may affect clinical/experimental groups more significantly.

In regression-based models of RC, McSweeny and colleagues (1993) and subsequent authors (e.g., Charter, 1996; Temkin et al., 1999) have implemented the standard error of estimate (SEE), or standard deviation of residuals, as the error term for simple regression-based RC (see SE[4] Appendix B). This value is often provided in linear regression analysis modules. It should also be noted that this term is often referred to as the standard error of prediction. Despite this, traditional prediction of an outcome score (in this case, “retest” score) for a new individual is given by the standard error of prediction for an individual (SEY: Crawford & Howell, 1998; SE[5] Appendix B). This latter standard error will vary as a function of control sample size and yields a non-constant or individualized error term that varies according to the extremity of the initial test score, such that more extreme scores are given a greater margin or error (Crawford & Garthwaite, 2006) and thus will always be somewhat larger than the McSweeny error term. This individualized error term has been implemented in RC by Salinsky, Storzbach, Dodrill, and Binder (2001) and Sherman and colleagues (2003), who reported the average SEY across individuals. Hinton-Bayre (2004) has demonstrated that the McSweeny error term will often be less than the Maassen (2004) error term and will always be less than the Christensen and Mendoza (1986) error term. Maassen and colleagues (2006) went on to prove that the Maassen error term will be less than the McSweeny error term when there is sufficient divergence from the mean (SX < SY) relative to the test reliability. The preferred error term has traditionally been dependent on the predicted value used. More recently, Maassen and colleagues (2006) have suggested that the pooled estimate (SE[3]) is generally preferable. However, it is currently not clear how one should proceed when there is evidence of differential practice or inequality of variances. Moreover, there has been little consideration of whether differential practice need to be accounted for when there is no evidence of a mean practice effect. It should be evident however that standard error terms will differ across models depending on the degree of test unreliability and inequality of initial and retest variance also referred to as differential practice (Maassen, 2004).

Deriving RC Models

When considering several well-known methodological RC papers, it can be seen that various methods of determining the predicted value and standard error have been combined (see Appendix C). When examining models more frequently used in neuropsychology research, it can be surmised that the Chelune model provides a uniform adjustment for practice and a uniform standard error for all cases. The McSweeny model modified the practice effect for each individual, yet uses a constant error term. The Crawford and Howell regression model uniquely adjusts the predicted score and standard error for each case separately, according to the extremity of values on predictor variables. Although the preferred method or methods may still be under debate, it can be seen that any of the above methods can be applied with access to basic test–retest descriptive statistics. A worked case example follows to demonstrate calculation and contrast outcomes for various models.

A Worked Example and Case Study

An example from the literature is used to demonstrate how RC statistics can be derived from published retest data. Woods and colleagues (2006) retested 57 healthy control subjects 1 year apart on a series of commonly used neuropsychological tests. The necessary test–retest statistics are reproduced in Table 1. This particular set of data was chosen because of the variability in test–retest reliability, retest standard deviations, and practice effects, which are all relevant to RC estimates as described above. It was observed that half of the tests examined demonstrated a significant mean practice effect even after 12 months. The reader should note that the Grooved Pegboard (dominant and non-dominant), Trails (A and B), and Wisconsin Card Sorting Test (WCST)-64 perseverative errors demonstrated improved performance by lower or faster scores. Only three tests had significant differential practice, with Hopkins Verbal Learning Test—Revised (HVLT-R) Delayed Recall and Wechsler Adult Intelligence Scale (WAIS)-III Digit Symbol demonstrating divergence from the mean (SX < SY), and Trails B demonstrating regression to the mean (SX > SY). It was observed that mean practice and differential practice occurred independently.

Table 1.

Retest statistics for 57 healthy controls tested 1-year apart

Test MX SX MY SY rXY MYMX SX/SY 
BVMT-R Total Trials 1–3 26.04 5.77 27.98 5.17 0.57 1.94a 1.12 
BVMT-R Delayed Recall 10.02 1.83 10.3 1.48 0.52 0.28 1.24 
COWAT-FAS 40.56 9.61 42.43 10.04 0.84 1.87a 0.96 
Grooved Pegboard (Dominant) 62.09 6.71 61.05 7.88 0.56 −1.04 0.85 
Grooved Pegboard (Non-Dominant) 68.19 9.05 67.63 10.3 0.71 −0.56 0.88 
HVLT-R Total Trials 1–3 27.51 3.88 27.86 4.11 0.51 0.35 0.94 
HVLT-R Delayed Recall 10.04 1.48 9.93 2.03 0.57 −0.11 0.73b 
PASAT-50 38.28 8.53 42.34 7.79 0.75 4.06a 1.09 
Trails A 23.37 5.93 23.18 7.37 0.35 −0.19 0.80 
Trails B 64.33 33.59 55.56 18.56 0.6 −8.77a 1.81b 
WAIS-III Letter-Number Sequencing 11.07 2.37 11.21 2.51 0.7 0.14 0.94 
WAIS-III Digit Symbol 76.8 14.11 82.3 17.21 0.86 5.5a 0.82b 
WAIS-III Symbol Search 34.09 7.83 36.61 8.86 0.63 2.52a 0.88 
WCST-64 Perseverative Responses 10.42 6.82 7.54 5.5 0.38 −2.88a 1.24 
Test MX SX MY SY rXY MYMX SX/SY 
BVMT-R Total Trials 1–3 26.04 5.77 27.98 5.17 0.57 1.94a 1.12 
BVMT-R Delayed Recall 10.02 1.83 10.3 1.48 0.52 0.28 1.24 
COWAT-FAS 40.56 9.61 42.43 10.04 0.84 1.87a 0.96 
Grooved Pegboard (Dominant) 62.09 6.71 61.05 7.88 0.56 −1.04 0.85 
Grooved Pegboard (Non-Dominant) 68.19 9.05 67.63 10.3 0.71 −0.56 0.88 
HVLT-R Total Trials 1–3 27.51 3.88 27.86 4.11 0.51 0.35 0.94 
HVLT-R Delayed Recall 10.04 1.48 9.93 2.03 0.57 −0.11 0.73b 
PASAT-50 38.28 8.53 42.34 7.79 0.75 4.06a 1.09 
Trails A 23.37 5.93 23.18 7.37 0.35 −0.19 0.80 
Trails B 64.33 33.59 55.56 18.56 0.6 −8.77a 1.81b 
WAIS-III Letter-Number Sequencing 11.07 2.37 11.21 2.51 0.7 0.14 0.94 
WAIS-III Digit Symbol 76.8 14.11 82.3 17.21 0.86 5.5a 0.82b 
WAIS-III Symbol Search 34.09 7.83 36.61 8.86 0.63 2.52a 0.88 
WCST-64 Perseverative Responses 10.42 6.82 7.54 5.5 0.38 −2.88a 1.24 

Notes: MX = initial test mean; MY = retest mean; SX = initial standard deviation; SY = retest standard deviation; rXY = test–retest correlation; BVMT-R = Brief Visuospatial Memory Test—Revised; COWAT = Controlled Oral Word Association Test; HVLT-R = Hopkins Verbal Learning Test—Revised; PASAT-50 = Paced Auditory Serial Addition Test; Trails A & B = Trail Making Test Parts A & B; WAIS-III = Wechsler Adult Intelligence Scale-III; WCST-64 = Wisconsin Card Sorting Test 64 card version. Source: Adapted from Woods and colleagues (2006).

aSignificant mean practice effects (MYMX) at 12 months, according to Woods and colleagues (2006).

bSignificant differential practice (SX/SY) according to the Maassen (2005) test for inequality of variability (p < .05, two-tailed). A ratio >1 indicates regression to the mean, a ratio <1 indicates diversion from the mean.

In order to demonstrate calculations fully, a case example was contrived such that the individual performed half of 1 SD below the mean on all measures at initial testing. This was done to highlight differences between RC methods, due to regression to the mean due to measurement error. Moreover, half of 1 SD below the mean represents a level of performance that is essentially in the “average” range at initial testing. Actual retest scores for the case example were set at 1 SD below the initial test control mean. These values were chosen to help demonstrate an obvious decline in performance and to highlight the potential effects of practice while further elucidating the differences between RC methods. Note that the example case initial and retest scores are presented in Table 2.

Table 2.

Intermediate and predicted retest scores (Y′) for reliable change models

Test Intermediate values
 
Actual Scores
 
Predicted scores
 
 b Y′[3] a Y′[3] b Y′[4] a Y′[4] X Y Y′[1] Y′[2] Y′[3] Y′[4] 
BVMT-R Total Trials 1–3 0.51 14.68 0.90 4.65 23.16 20.27 25.10 26.34 26.51 25.40 
BVMT-R Delayed Recall 0.42 6.09 0.81 2.20 9.11 8.19 9.39 9.82 9.92 9.56 
COWAT-FAS 0.88 6.84 1.04 0.06 35.76 30.95 37.63 38.39 38.21 37.41 
Grooved Pegboard (dominant) 0.66 20.22 1.17 −11.87 65.45 68.80 64.41 62.93 63.26 64.99 
Grooved Pegboard (non-dominant) 0.81 12.53 1.14 −9.98 72.72 77.24 72.16 70.84 71.29 72.78 
HVLT-R Total Trials 1–3 0.54 13.00 1.06 −1.28 25.57 23.63 25.92 26.87 26.81 25.81 
HVLT-R Delayed Recall 0.78 2.08 1.37 −3.84 9.30 8.56 9.19 9.51 9.35 8.92 
PASAT-50 0.68 16.12 0.91 7.38 34.02 29.75 38.08 39.14 39.42 38.45 
Trails A 0.43 13.01 1.24 −5.87 26.34 29.30 26.15 24.22 24.47 26.87 
Trails B 0.33 34.23 0.55 20.01 81.13 97.92 72.36 65.64 61.13 64.84 
WAIS-III Letter-Number Sequencing 0.74 3.00 1.06 −0.51 9.89 8.70 10.03 10.38 10.33 9.96 
WAIS-III Digit Symbol 1.05 1.74 1.22 −11.37 69.75 62.69 75.25 76.23 74.90 73.70 
WAIS-III Symbol Search 0.71 12.31 1.13 −1.96 30.18 26.26 32.70 34.14 33.82 32.18 
WCST-64 Perseverative Responses 0.31 4.35 0.81 −0.86 13.83 17.24 10.95 8.84 8.59 10.29 
Test Intermediate values
 
Actual Scores
 
Predicted scores
 
 b Y′[3] a Y′[3] b Y′[4] a Y′[4] X Y Y′[1] Y′[2] Y′[3] Y′[4] 
BVMT-R Total Trials 1–3 0.51 14.68 0.90 4.65 23.16 20.27 25.10 26.34 26.51 25.40 
BVMT-R Delayed Recall 0.42 6.09 0.81 2.20 9.11 8.19 9.39 9.82 9.92 9.56 
COWAT-FAS 0.88 6.84 1.04 0.06 35.76 30.95 37.63 38.39 38.21 37.41 
Grooved Pegboard (dominant) 0.66 20.22 1.17 −11.87 65.45 68.80 64.41 62.93 63.26 64.99 
Grooved Pegboard (non-dominant) 0.81 12.53 1.14 −9.98 72.72 77.24 72.16 70.84 71.29 72.78 
HVLT-R Total Trials 1–3 0.54 13.00 1.06 −1.28 25.57 23.63 25.92 26.87 26.81 25.81 
HVLT-R Delayed Recall 0.78 2.08 1.37 −3.84 9.30 8.56 9.19 9.51 9.35 8.92 
PASAT-50 0.68 16.12 0.91 7.38 34.02 29.75 38.08 39.14 39.42 38.45 
Trails A 0.43 13.01 1.24 −5.87 26.34 29.30 26.15 24.22 24.47 26.87 
Trails B 0.33 34.23 0.55 20.01 81.13 97.92 72.36 65.64 61.13 64.84 
WAIS-III Letter-Number Sequencing 0.74 3.00 1.06 −0.51 9.89 8.70 10.03 10.38 10.33 9.96 
WAIS-III Digit Symbol 1.05 1.74 1.22 −11.37 69.75 62.69 75.25 76.23 74.90 73.70 
WAIS-III Symbol Search 0.71 12.31 1.13 −1.96 30.18 26.26 32.70 34.14 33.82 32.18 
WCST-64 Perseverative Responses 0.31 4.35 0.81 −0.86 13.83 17.24 10.95 8.84 8.59 10.29 

Notes: BVMT-R = Brief Visuospatial Memory Test—Revised; COWAT = Controlled Oral Word Association Test; HVLT-R = Hopkins Verbal Learning Test—Revised; PASAT-50 = Paced Auditory Serial Addition Test; Trails A & B = Trail Making Test Parts A & B; WAIS-III = Wechsler Adult Intelligence Scale-III; WCST-64 = Wisconsin Card Sorting Test 64 card version.

A series of tables are presented to allow the interested reader to follow how calculations are made; as in many papers, the actual formula used has not been clearly specified, leading to confusion and difficulty in replication (Hinton-Bayre, 2000). Table 2 presents the intermediate calculations required to derive predicted scores, with the latter also presented. It can be clearly seen that predicted scores differed considerably. In particular, the regression-based expressions made far greater adjustments to retest scores when the measures were less reliable (e.g., Trails A and WCST 64 perseverative responses). A similar discrepancy in predicted scores was observed when inequality of variance was clearly evident (e.g., Trails B). Maassen and colleagues (2006) have considered that in settings where test–retest coefficients are poor or retest variance is markedly less than initial test variance, the resultant adjustments for regression to the mean will likely lead to a bias in estimating true change. When measures are reliable and differential practice is not present (i.e., test and retest variances are equivalent), predicted values will be close across methods, irrespective of practice, for example, refer to the Controlled Oral Words Association test, PASAT-50, and WAIS-III Digit Symbol. The structure of each RC model analyzed below can be found in Appendix C. The reader will observe that either the McSweeny or the Speer predicted scores provide the more extreme estimates when compared with the individual's initial score. As would be predicted, the McSweeny score will be more extreme when inequality of control variability suggests regression to the mean (SX > SY), and the Speer score more extreme when there is divergence from the mean (SX < SY). The least extreme corrections were made for the Chelune or the Maassen predicted scores. Regression to mean due to inequality of variance meant the Chelune predicted score was the most conservative, whereas divergence from the mean meant the Maassen predicted score was most conservative.

Table 3 presents standard error estimates based on expressions presented in Appendix B. It can be seen that when retest variance is less than initial variance (SX > SY), the regression-based error terms (SEE [4] and SEY [5]) will be narrower than the other error terms. This is mainly due to the fact that regression-based error is based solely on retest variance (SY). Moreover, SEY[5] will always be slightly wider than SEE[4] for all subjects, with SEY[5] increasing for more extreme individuals at initial testing (Crawford & Howell, 1998). Conversely, the Jacobson and Truax SE[2] will have the largest error term when SX > SY as the smaller SY is not considered. When divergence from the mean was seen (SX < SY), the Jacobson and Truax SE[2] was usually the smallest error term, as only initial test variability is used. The Christensen and Mendoza SE[1] was the largest error term when SX < SY. Notable exceptions occurred when the adjustment for reliability differentially reduced SE[4] despite larger SY (e.g., HVLT-R Total 1–3, WAIS-III LNS). Naturally, poor reliability will lead to increased values for all error terms. When reliability is imperfect (rXY < 1) and variability estimates are equal the regression error terms will be smallest.

Table 3.

Standard error (SE) scores for reliable change models

Test SE[1] SE[2] SE[3] SE[4] SE[5] 
BVMT-R Total Trials 1–3 5.10 5.35 5.08 4.25 4.29 
BVMT-R Delayed Recall 1.65 1.79 1.63 1.26 1.28 
COWAT-FAS 5.57 5.44 5.56 5.45 5.51 
Grooved Pegboard (dominant) 6.92 6.29 6.87 6.53 6.60 
Grooved Pegboard (non-dominant) 7.46 6.89 7.38 7.25 7.33 
HVLT-R Total Trials 1–3 3.96 3.84 3.96 3.54 3.57 
HVLT-R Delayed Recall 1.70 1.37 1.65 1.67 1.69 
PASAT-50 5.81 6.03 5.78 5.15 5.21 
Trails A 7.67 6.76 7.63 6.90 6.98 
Trails B 26.92 30.04 24.27 14.85 15.01 
WAIS-III Letter-Number Sequencing 1.89 1.84 1.89 1.79 1.81 
WAIS-III Digit Symbol 8.81 7.47 8.33 8.78 8.88 
WAIS-III Symbol Search 7.24 6.74 7.19 6.88 6.96 
WCST-64 Perseverative Responses 6.95 7.59 6.90 5.09 5.14 
Test SE[1] SE[2] SE[3] SE[4] SE[5] 
BVMT-R Total Trials 1–3 5.10 5.35 5.08 4.25 4.29 
BVMT-R Delayed Recall 1.65 1.79 1.63 1.26 1.28 
COWAT-FAS 5.57 5.44 5.56 5.45 5.51 
Grooved Pegboard (dominant) 6.92 6.29 6.87 6.53 6.60 
Grooved Pegboard (non-dominant) 7.46 6.89 7.38 7.25 7.33 
HVLT-R Total Trials 1–3 3.96 3.84 3.96 3.54 3.57 
HVLT-R Delayed Recall 1.70 1.37 1.65 1.67 1.69 
PASAT-50 5.81 6.03 5.78 5.15 5.21 
Trails A 7.67 6.76 7.63 6.90 6.98 
Trails B 26.92 30.04 24.27 14.85 15.01 
WAIS-III Letter-Number Sequencing 1.89 1.84 1.89 1.79 1.81 
WAIS-III Digit Symbol 8.81 7.47 8.33 8.78 8.88 
WAIS-III Symbol Search 7.24 6.74 7.19 6.88 6.96 
WCST-64 Perseverative Responses 6.95 7.59 6.90 5.09 5.14 

Notes: BVMT-R = Brief Visuospatial Memory Test—Revised; COWAT = Controlled Oral Word Association Test; HVLT-R = Hopkins Verbal Learning Test—Revised; PASAT-50 = Paced Auditory Serial Addition Test; Trails A & B = Trail Making Test Parts A & B; WAIS-III = Wechsler Adult Intelligence Scale-III; WCST-64 = Wisconsin Card Sorting Test 64 card version.

As mentioned above, the RC methods examined used varied combinations of the predicted retest scores and standard errors already calculated (see Appendix C). Table 4 provides the RC scores for each test for several RC models in clinical psychology and neuropsychology. Overall, the McSweeny and colleagues (1993) and Charter (1996) RC indices tended to produce the most extreme scores. This is because the predicted retest score under regression models is going to be proportionate to the distance the initial score is from the initial control mean. That is, if an initial score is below the mean as was deliberately the case here, the individual must overcome mean practice and regression to the mean due to imperfect test reliability to obtain a similar RC score when compared with the Chelune mean practice adjustment. This will of course be further exaggerated by increasingly poor test–retest reliability and differential practice leading to further regression to the mean. The reader should also note that, given the McSweeny and Charter models use the same error term, the predicted score and thus RC score for these two models differ according to the direction of the inequality of variance. When retest variance exceeds initial test variance, the absolute RC score will be more extreme for the Charter model. The McSweeny RC score will be larger when initial test variance exceeds retest variance. The difference observed between these two regression-based models is due to the fact that the McSweeny predicted score is affected by inequality of variance, whereas the Charter method is not. As noted earlier, it can be demonstrated that when initial and retest variances are equal, the two methods will agree precisely. It is important to be aware that Charter and Feldt (2000) no longer endorse the method attributed to the former author. The Charter model was included as it had been reported in the neuropsychological literature (Basso et al., 1999) and helped demonstrate the influence of measurement error and differential practice when compared with the McSweeny model. The Crawford and Howell (1998) RC score will always be less extreme than the corresponding McSweeny RC score, given the former error term is always larger.

Table 4.

Reliable change scores for various models: Individual “below” control mean at initial testing

Test RC
 
 J&T Chelune Temkin Iverson Maassen Charter McSweeny Crawford 
BVMT-R Total Trials 1–3 −0.54 −0.90L −0.95 −0.95 −1.01 −1.43 −1.47M −1.45 
BVMT-R Delayed Recall −0.51 −0.67L −0.72 −0.73 −0.84 −1.29 −1.36M −1.35 
COWAT-FAS −0.88 −1.23 −1.20 −1.20 −1.16L −1.37M −1.33 −1.32 
Grooved Pegboard (dominant) −0.53 −0.70 −0.64 −0.64 −0.55L −0.90M −0.85 −0.84 
Grooved Pegboard (non-dominant) −0.66a −0.74 −0.68 −0.69 −0.60L −0.88M −0.82 −0.81 
HVLT-R Total Trials 1–3 −0.51 −0.60 −0.58 −0.58 −0.55L −0.92M −0.90 −0.89 
HVLT-R Delayed Recall −0.54a −0.46 −0.37 −0.38 −0.22L −0.57M −0.47 −0.47 
PASAT-50 −0.71 −1.38L −1.43 −1.44 −1.51 −1.82* −1.88M−1.86* 
Trails A −0.44a −0.47 −0.41 −0.41 −0.32L −0.74M −0.70 −0.69 
Trails B −0.56 −0.85L −0.95 −1.05 −1.36 −2.17* −2.48M−2.45* 
WAIS-III Letter-Number Sequencing −0.65 −0.72 −0.70 −0.70 −0.66L −0.94M −0.91 −0.90 
WAIS-III Digit Symbol −0.94 −1.68M−1.43 −1.51 −1.32L −1.54 −1.39 −1.38 
WAIS-III Symbol Search −0.58 −0.96 −0.89 −0.89 −0.82L −1.15M −1.10 −1.09 
WCST-64 Perseverative Responses −0.45 −0.83L −0.91 −0.91 −1.01 −1.65* −1.70M−1.68* 
Test RC
 
 J&T Chelune Temkin Iverson Maassen Charter McSweeny Crawford 
BVMT-R Total Trials 1–3 −0.54 −0.90L −0.95 −0.95 −1.01 −1.43 −1.47M −1.45 
BVMT-R Delayed Recall −0.51 −0.67L −0.72 −0.73 −0.84 −1.29 −1.36M −1.35 
COWAT-FAS −0.88 −1.23 −1.20 −1.20 −1.16L −1.37M −1.33 −1.32 
Grooved Pegboard (dominant) −0.53 −0.70 −0.64 −0.64 −0.55L −0.90M −0.85 −0.84 
Grooved Pegboard (non-dominant) −0.66a −0.74 −0.68 −0.69 −0.60L −0.88M −0.82 −0.81 
HVLT-R Total Trials 1–3 −0.51 −0.60 −0.58 −0.58 −0.55L −0.92M −0.90 −0.89 
HVLT-R Delayed Recall −0.54a −0.46 −0.37 −0.38 −0.22L −0.57M −0.47 −0.47 
PASAT-50 −0.71 −1.38L −1.43 −1.44 −1.51 −1.82* −1.88M−1.86* 
Trails A −0.44a −0.47 −0.41 −0.41 −0.32L −0.74M −0.70 −0.69 
Trails B −0.56 −0.85L −0.95 −1.05 −1.36 −2.17* −2.48M−2.45* 
WAIS-III Letter-Number Sequencing −0.65 −0.72 −0.70 −0.70 −0.66L −0.94M −0.91 −0.90 
WAIS-III Digit Symbol −0.94 −1.68M−1.43 −1.51 −1.32L −1.54 −1.39 −1.38 
WAIS-III Symbol Search −0.58 −0.96 −0.89 −0.89 −0.82L −1.15M −1.10 −1.09 
WCST-64 Perseverative Responses −0.45 −0.83L −0.91 −0.91 −1.01 −1.65* −1.70M−1.68* 

Notes: BVMT-R = Brief Visuospatial Memory Test—Revised; COWAT = Controlled Oral Word Association Test; HVLT-R = Hopkins Verbal Learning Test—Revised; PASAT-50 = Paced Auditory Serial Addition Test; Trails A & B = Trail Making Test Parts A & B; WAIS-III = Wechsler Adult Intelligence Scale-III; WCST-64 = Wisconsin Card Sorting Test 64 card version.

LLeast extreme score across models (excluding RC J&T).

MMost extreme score across models.

aRC estimate not the most conservative, as practice effect approximates zero.

*p < .05 one-tailed.

The Jacobson and Truax (1991) RC model was the most conservative in the current example in most circumstances as it does not account for practice. Exceptions only occurred when the mean practice effect was essentially zero (e.g., Grooved Pegboard non-dominant, HVLT-R Delayed Recall, and Trails A). Thus, an argument could be made for adjusting for mean practice in the absence of a significant effect, although with adequate sample sizes this should not be an issue. When examining the other models that account for practice, the Maassen and colleagues (2006) and the Chelune and colleagues (1993) RC models were most conservative. Again inequality of variance was a key influence. The RC Chelune was most conservative when there was regression to the mean due to inequality of variances (SX > SY), as the standard error SEdiff[2] only accounts for initial control variance. Ignoring the Chelune RC, the next most conservative was the Temkin RC model followed by the Iverson RC model, which used pooled error estimates SE[1] and SE[3], respectively. The Maassen RC was the most conservative when there was divergence from the mean (SX < SY). When examining differential practice, the most noticeable discrepancy occurred when there was significant regression to the mean (SX > SY) observed for Trails B. This result was also exaggerated by the relatively poor reliability, thus further adjusting regression-based predicted scores.

Conceivably, variations in RC estimates will lead to differences in clinical classification. Using a cut-score of −1.645 (one-tail, p < .05), significant impairment was observed in regression models on tests where there was a significant practice effect and regression to mean through inequality of variability (SX > SY) (viz. PASAT-50, TMT-B, and WCST-64 perseverative responses). The WAIS-III DSS was only significant for the Chelune model, this was due to the presence of a practice effect, and divergence from the mean (SX < SY) with good reliability. Hence, the RC method chosen would lead to a different result which could in turn influence clinical conclusions, management options, and recommendations for a patient.

The reader can begin to appreciate that there is no model that is more sensitive or conservative than another under all conditions, more so this will depend on the parameters involved. To emphasize this point, a second brief case is presented in Table 5. For this case, the individual started 1 SD above the control mean at initial testing and then at 0.5 SD above the initial mean at retest. This represents 0.5 SD drop in performance relative to initial control mean. The absolute magnitude of drop in performance was the same as used in the first case, whereas the first case started below the control mean at initial testing. In the second case, the Maassen and colleagues (2006) and Iverson, Lovell, and Collins (2003) provided the most extreme RC scores. As these methods employ the same error term, the difference in RC values was due to inequality of variance which alters the Maassen predicted score. When there was regression to the mean (SX > SY), the Iverson model was most extreme and divergence from the mean (SX < SY) rendered the Maassen model more extreme. The Crawford and Howell (1998) model was least extreme when differential practice suggested regression to the mean SX > SY. The Charter (1996) model was least extreme when there was divergence from the mean (SX < SY). Despite the same absolute magnitude of change in both cases 1 and 2, very few significant changes were seen in case 2. In fact, the only significant RC scores were seen on WAIS-III Digit Symbol Chelune and colleagues' (1993) and Maassen and colleagues' (2006) models. This indicates that the relative position of the individual to controls at initial testing may have a considerable impact on subsequent RC score depending on method employed. It is notable that case 2 started 1 SD from the mean, whereas case 1 started one-half standard deviation from the mean, thus the former is more affected by adjustment for measurement error under regression models. Nonetheless, the RC model chosen would again influence the interpretation of the presence of change performance.

Table 5.

Reliable change scores for various models: Individual “above” control mean at initial testing

Test RC
 
 J&T Chelune Temkin Iverson Maassen Charter McSweeny Crawford 
BVMT-R Total Trials 1–3 −0.54 −0.90 −0.95 −0.95M −0.83 −0.55 −0.47 −0.46L 
BVMT-R Delayed Recall −0.51 −0.67 −0.72 −0.73M −0.52 −0.25 −0.11 −0.10L 
COWAT-FAS −0.88 −1.23 −1.20 −1.20 −1.28M −0.94L −1.01 −0.99 
Grooved Pegboard (dominant) −0.53 −0.70 −0.64 −0.64 −0.81M −0.22L −0.32 −0.32 
Grooved Pegboard (non-dominant) −0.66 −0.74 −0.68 −0.69 −0.86M −0.34L −0.46 −0.45 
HVLT-R Total Trials 1–3 −0.51 −0.60 −0.58 −0.58 −0.64M −0.11L −0.14 −0.14 
HVLT-R Delayed Recall −0.54 −0.46 −0.37 −0.38 −0.72M 0.00L −0.18 −0.18 
PASAT-50 −0.71 −1.38 −1.43 −1.44M −1.31 −1.20 −1.09 −1.08L 
Trails A −0.44 −0.47 −0.41 −0.41 −0.60M 0.10L 0.03 0.03 
Trails B −0.56 −0.85 −0.95 −1.05M −0.43 −0.82 −0.21 −0.21L 
WAIS-III Letter-Number Sequencing −0.65 −0.72 −0.70 −0.70 −0.77M −0.34L −0.40 −0.39 
WAIS-III Digit Symbol −0.94 −1.68* −1.43 −1.51 −1.88M−1.20L −1.51 −1.48 
WAIS-III Symbol Search −0.58 −0.96 −0.89 −0.89 −1.04M −0.51L −0.61 −0.60 
WCST-64 Perseverative Responses −0.45 −0.83 −0.91 −0.91M −0.72 −0.41 −0.31 −0.30L 
Test RC
 
 J&T Chelune Temkin Iverson Maassen Charter McSweeny Crawford 
BVMT-R Total Trials 1–3 −0.54 −0.90 −0.95 −0.95M −0.83 −0.55 −0.47 −0.46L 
BVMT-R Delayed Recall −0.51 −0.67 −0.72 −0.73M −0.52 −0.25 −0.11 −0.10L 
COWAT-FAS −0.88 −1.23 −1.20 −1.20 −1.28M −0.94L −1.01 −0.99 
Grooved Pegboard (dominant) −0.53 −0.70 −0.64 −0.64 −0.81M −0.22L −0.32 −0.32 
Grooved Pegboard (non-dominant) −0.66 −0.74 −0.68 −0.69 −0.86M −0.34L −0.46 −0.45 
HVLT-R Total Trials 1–3 −0.51 −0.60 −0.58 −0.58 −0.64M −0.11L −0.14 −0.14 
HVLT-R Delayed Recall −0.54 −0.46 −0.37 −0.38 −0.72M 0.00L −0.18 −0.18 
PASAT-50 −0.71 −1.38 −1.43 −1.44M −1.31 −1.20 −1.09 −1.08L 
Trails A −0.44 −0.47 −0.41 −0.41 −0.60M 0.10L 0.03 0.03 
Trails B −0.56 −0.85 −0.95 −1.05M −0.43 −0.82 −0.21 −0.21L 
WAIS-III Letter-Number Sequencing −0.65 −0.72 −0.70 −0.70 −0.77M −0.34L −0.40 −0.39 
WAIS-III Digit Symbol −0.94 −1.68* −1.43 −1.51 −1.88M−1.20L −1.51 −1.48 
WAIS-III Symbol Search −0.58 −0.96 −0.89 −0.89 −1.04M −0.51L −0.61 −0.60 
WCST-64 Perseverative Responses −0.45 −0.83 −0.91 −0.91M −0.72 −0.41 −0.31 −0.30L 

Notes: BVMT-R = Brief Visuospatial Memory Test—Revised; COWAT = Controlled Oral Word Association Test; HVLT-R = Hopkins Verbal Learning Test—Revised; PASAT-50 = Paced Auditory Serial Addition Test; Trails A & B = Trail Making Test Parts A & B; WAIS-III = Wechsler Adult Intelligence Scale-III; WCST-64 = Wisconsin Card Sorting Test 64 card version; RC = reliable change.

LLeast extreme score across models (excluding RC J&T).

MMost extreme score across models.

*p < .05 one-tailed.

It is important to note that the calculations demonstrated are only possible with a simple linear regression model that has only the initial test score as a predictor. If a multiple regression model is desired when b weights are not available, direct conversion is not readily possible without a correlation matrix of predictors and criterion. However, it has been demonstrated that in reliable measures, extra predictors add negligibly to prediction accuracy (e.g., McSweeny et al., 1993; Temkin et al., 1999).

Discussion

This paper demonstrated that popular RC models have a shared fundamental structure. The individual's predicted retest score can be subtracted from their actual retest score and then divided by a standard error. This will yield a standardized score which may be interpreted via a standard Z distribution or t distribution if the control sample is small (e.g., n < 50). It was also demonstrated that possession of basic test–retest statistics will allow the interested clinician or researcher to readily derive most of the popular RC model(s), or potentially interchange models. In particular, all that is required to generate the predicted retest score and standard error are the test–retest means, standard deviations, and a reliability coefficient from a control group. Moreover, it should now be obvious from the main text that agreement between RC models will depend on practice effect, reliability, variance inequality, and the individual cases' relative position to the control group at initial testing. The formulas presented in Appendices A and B were algebraically manipulated to reflect the basic parameters required, and as such may not directly match the formulas presented in cited papers but are mathematically equivalent. It has already been aptly demonstrated that failure to correct for significant practice will bias RC estimates, through underestimating the degree of impairment (Heaton, Temkin, & Dikmen 2001), as was also observed in the present study. Lesser reliability will result in greater correction toward to control retest mean, the direction of which will depend on whether the case in question falls above or below the control mean at initial testing. Lesser reliability will of course increase error for all methods. When control initial test and retest variances are equal agreement between methods will be at a maximum. Inequality of variance affects the predicted retest score in regression-based RC models, with the magnitude depending on initial test performance relative to controls. Regression models provided more extreme estimates when control variability reduced on retest. This occurred as the case example started below the control group at initial testing. It is important to note this effect was reversed when case performed better than the control mean at initial testing as reflected in Table 5. When the case started above the control mean, the regression methods provided the most conservative estimates.

An important implication of this study is that there was apparently no universally more sensitive or conservative RC model. The classification bias will vary depending on the individual case and the nature of the control retest data parameters. This result appears to be contrary to a recent suggestion by Maassen, Bossema, and Brand (2009). These authors recently compared the McSweeny, Iverson, and Maassen RC models and suggested that the McSweeny model would be more appropriate when sensitivity is preferred, and the Maassen model when a more conservative approach is desired. The current study however suggested that any attempt to select a method to preferentially satisfy sensitivity or specificity concerns will potentially mean than a different model be used for different tests, or even different individuals. This seems hardly defensible from a theoretical perspective. Moreover, it also seems untenable to separately select a predicted retest score and standard error to maximize sensitivity or specificity. The two estimates should at least be theoretically compatible. Until a clearer consensus on RC methodology is reached, using reliable measures, with comparable initial and retest variances, and a well-matched control sample should minimize concerns regarding the choice of model. In fact, some consideration should be given to whether an RC analysis of any type be conducted when conditions above are not met. In the Woods and colleagues (2006) paper, 9 of the 14 tests failed to reach a test–retest reliability >0.70. It could be argued that interpretation of change in such settings cannot be justified. Statistical adjustments do not really correct methodological issues.

It had been suggested that classification discrepancies can be minimized through using measures with reliability estimates >0.70, ensuring the clinical case falls within 1 SD of the control group at initial testing (Hinton-Bayre, 2005). It is currently not clear what action should be taken when there is significant differential practice. It is clear however that such a discrepancy can greatly affect consistency across models. In the present study, the largest inequality of variance occurred in TMT-B, where the variance ratio (forumla) was 3.28:1, which saw RC scores differ by 1.63. From a theoretical perspective, the present author would prefer a regression error term when there is differential practice. In the presence of significant differential practice (SXSY), Maassen (2005) had initially suggested that the McSweeny RC model would be appropriate. It is still not clear how to control for such effects and promote consistency of variability over time. A systematic evaluation of the effects of salient factors on RC scores seems pertinent.

From a pragmatic point of view, it is important that the individual match the control sample well. This not only includes performance at initial testing, but also the interval of retesting. The use of increasingly available retest norms must bear this caveat. Variations in retest interval will likely lead to differences in practice effect, test–retest reliability, and possibly even differential practice. All of which have been demonstrated to have significant bearing on RC estimates. It is important to note that retest norms are subject to the same limitations as regular norms and clinicians are responsible for selecting both so as to match the individual and setting as closely as possible. As was seen, failure to do so could result in potentially biased predicted scores when individuals fall at the extreme ends on baseline testing. This exemplifies the well known problem of increased error inherent when making individual versus group predictions. Most importantly, it is incumbent on researchers and clinicians to strive to use their assessment tools in the most reliable manner possible. Using measures in a reliable and valid manner is perhaps the single best way to ensure a dependable result irrespective of formula used (Hinton-Bayre, 2005).

In the current study, the case examples were chosen to shown an absolute decline in test performance. This decline was set uniformly as a drop equivalent to 0.5 SD for that test. Moreover, the initial and naturally retest scores were also uniform across tests. In reality, the relative position at initial testing will vary and the degree of change will vary depending on the nature of the insult or intervention. Such variation will of course lead to variations in the equivalence of models as alluded to earlier. It is also worth noting that the difference between the control and clinical group (viz. effect size) will influence the comparability of RC model results. When the effect size is small, indicating a large degree of distribution overlap, the classification rates will be more similar across models. The same would be seen in the setting of a large effect size, where there is very little overlap of distributions, where again classifications would be likely quite consistent across models. Indeed, the nadir of classification consistency across models (not accuracy) would likely occur where the mean of the clinical sample falls close to the control cut-score, for example, 95th percentile. Although RC seeks to evaluate relative unusualness of a change in an individual, the inherent magnitude of effect which includes measure sensitivity will alter the agreement between RC models.

Despite the varied points of view, several studies have shown that discrepancies in classification between different models will be negligible when measures are used reliably in control samples (e.g., Temkin et al., 1999). However, it is not so clear to what extent variations in RC estimates will affect classifications in clinical samples. Moreover, consideration of only control data provides estimates of specificity, not of sensitivity. Maassen and colleagues (2006) have pointed out that as clinical cases are more likely to fall at the extremes of any predicted score confidence interval, small variations in RC cut-offs may result in relatively larger discrepancies across models. This was reportedly not the case according to Heaton and colleagues (2001). These authors concluded that the simpler Chelune model performed similarly to the McSweeny regression models. It should be noted, however, that the reliabilities of tests were quite good and there was limited evidence of differential practice.

It should also be noted that confidence or prediction intervals can be readily determined if preferred to RC scores. In the following expression, Z is the standard score that represents a desired level of confidence (e.g., Z scores of ±1.645 and ±1.96 correspond to a 90% and 95% level of confidence, respectively) and CI is the resulting confidence interval. Crawford and Garthwaite (2006) suggested that a t-score may be preferable to Z-score for either point or interval estimates when the comparison group is small (e.g., n < 50).

 
formula

In constructing the interval, an error band is set symmetrically about the predicted test score, thereby providing a range of likely retest values. An actual retest score can then be compared to determine statistical significance. The merits of point versus interval estimates for determining statistical significance have long been debated. Increasingly, normative and empirical data are presented in confidence interval format. The main technical benefit of confidence intervals is the provision of a range of possible outcome scores, thus reminding us of the inherent unreliability in any stochastic estimate.

Although the focus of the paper has been largely technical, a brief consideration of the wider implications of using individual change indices is warranted. Many studies of incident effect or treatment efficacy rely on the group-based analyses, for example, analysis of variance (ANOVA) models. This provides only an estimate of overall effect. It has been previously considered that even statistically significant associations between risk factor presence and actual disease state do not correspond to good subsequent classification (Boyko & Alderman, 1990). In fact, exceedingly large associations were required to demonstrate adequate diagnostic classification rates. This point has been further illustrated when assessing the weakness of odds ratios in providing useful information about determining actual disease states (Pepe, Janes, Longton, Leisenring, & Newcomb, 2004). In kind, it can be considered that even when significant differences are seen on a group level, for example, via ANOVA, that this does not necessarily translate into good diagnostic classification of disease state, or more specifically cognitive impairment. A fundamental reason for the non-concordance between group- and individual-based statistical models is the inherent increased error involved in the analysis of the latter. For example, Hinton-Bayre, Geffen, Geffen, McFarland, and Friis (1999) demonstrated a significant decline in cognitive performance following concussion on several measures of information processing speed. Subsequent individual analyses demonstrated that less than half of the clinical group actually demonstrated significant RC scores on any given measure. This again highlights the difference in sensitivity between group and individual analyses. The requirement for optimal reliability in measurement cannot be overstated in order to minimize the inherent error in individual prediction. Remaining cognizant of the diagnostic accuracy of a measure in a given context is imperative to assessing its clinical application. To this end, RC analyses could be regarded as the necessary next step in determining the utility of neuropsychological measurement in any repeated or longitudinal-based design.

In contrast, where the effect or intervention renders only a small possibly non-significant overall change, RC models provide a means of estimating which subset of cases might have been affected by an event or intervention. In the author's experience, there have been many occasions where interventions/resources have been withheld or not approved for use as there is no significant “evidence” based on group analyses. The author suspects that we have all been in a situation where we have seen a particular benefit/detriment for an individual from an event/intervention that may not fall under the traditional clinical expectation. RC models potentially offer a metric to isolate those people. Thus, RC models for examining individual change can be viewed as an extension of more traditional statistical models examining an “averaged” effect. However, we must be mindful of the inherent lesser degree of sensitivity when making interpretations of outcome and subsequent recommendations.

The present paper highlighted the shared fundamental structure of published RC indices. It also demonstrated that basic descriptive test–retest statistics can be mathematically manipulated to yield such indices. Individual RC scores will vary across the models examined depending on practice effects, test reliability, comparability of the clinical case to controls at initial testing, and presence of differential practice effects. A systematic evaluation of the effects of these pertinent parameters seems warranted. It is still unclear whether a particular model is to be preferred. On this matter, a concerted effort to arrive at a consensus approach should be paramount. This study has suggested that no single model will be more sensitive or conservative over all conditions. And while numerous authors have concluded that there is little difference in classifications across models in control data, there is limited systematic consideration of variations in clinical samples. It might also be considered that the use of RC models of any description might not be appropriate unless certain conditions are met, including tests having adequate reliability and suitable control data including a matched retest interval. Differential practice appears to have a significant bearing on RC models and also warrants further investigation.

Conflict of Interest

None declared.

Appendix A: Predicted scores (Y′) in reliable change methods

Author Predicted score (Y′) Components Corrections
 
   Relative position Reliability Inequality of variance 
Y′[1] Chelune and colleagues (1993) X + (MYMXX = individual's initial test score, MX = control group initial test mean, MY = control group retest mean No No No 
Y′[2] Speer (1992)a MY + rXY(XMXComponents as described above Yes Yes No 
Y′[3] McSweeny and colleagues (1993) bX + a b = r × (SY/SX), b = slope of least squares regression line of retest (Y) on initial test (X), where, r = Pearson's correlation, SY = control group retest variance, SX = control group initial test variance, a = MY-b MX, a = constant least-squares regression line Yes Yes Yes 
Y′[4] Maassen and colleagues (2006) badjX + aadj badj = SY/SX, aadj = MY− badjMX Yes Nob Yes 
Author Predicted score (Y′) Components Corrections
 
   Relative position Reliability Inequality of variance 
Y′[1] Chelune and colleagues (1993) X + (MYMXX = individual's initial test score, MX = control group initial test mean, MY = control group retest mean No No No 
Y′[2] Speer (1992)a MY + rXY(XMXComponents as described above Yes Yes No 
Y′[3] McSweeny and colleagues (1993) bX + a b = r × (SY/SX), b = slope of least squares regression line of retest (Y) on initial test (X), where, r = Pearson's correlation, SY = control group retest variance, SX = control group initial test variance, a = MY-b MX, a = constant least-squares regression line Yes Yes Yes 
Y′[4] Maassen and colleagues (2006) badjX + aadj badj = SY/SX, aadj = MY− badjMX Yes Nob Yes 

Note: All predicted scores expressions above adjust for practice.

aSpeer (1992) did not originally correct for practice, using MX rather than MY.

bCalculated predicted score will not vary according to level of reliability.

Appendix B: Standard error (SE) estimates in reliable change methods

Author  Standard error (SE) expression  Components  Comments 
SE[1] Christensen and Mendoza (1986)  forumla   S X = control group initial test standard deviation, SY = control group retest standard deviation, rXY = Pearson's correlation coefficient for initial and retest score  Can also calculate as the standard deviation of difference scores seen in the control group 
Combines initial and retest variability 
SE[2] Jacobson and Truax (1991)  forumla   As described above  Only uses initial test variability in estimate of error 
Will equal Christensen and Mendoza when SX = SY 
SE[3] Maassen (2004)  forumla   As described above  Will equal Christensen and Mendoza when SX = SY 
Based on classical test theory 
SE[4] McSweeny and colleagues (1993)  forumla   As described above  Standard deviation of the residuals seen in simple linear regression in controls 
SE[5] Crawford and Howell (1998)  forumla   SE E is as McSweeny and colleagues (1993), n = the control sample size, else as described above  Only error term to determine individual error estimate 
Adjusts error based on relative position at initial testing 
Author  Standard error (SE) expression  Components  Comments 
SE[1] Christensen and Mendoza (1986)  forumla   S X = control group initial test standard deviation, SY = control group retest standard deviation, rXY = Pearson's correlation coefficient for initial and retest score  Can also calculate as the standard deviation of difference scores seen in the control group 
Combines initial and retest variability 
SE[2] Jacobson and Truax (1991)  forumla   As described above  Only uses initial test variability in estimate of error 
Will equal Christensen and Mendoza when SX = SY 
SE[3] Maassen (2004)  forumla   As described above  Will equal Christensen and Mendoza when SX = SY 
Based on classical test theory 
SE[4] McSweeny and colleagues (1993)  forumla   As described above  Standard deviation of the residuals seen in simple linear regression in controls 
SE[5] Crawford and Howell (1998)  forumla   SE E is as McSweeny and colleagues (1993), n = the control sample size, else as described above  Only error term to determine individual error estimate 
Adjusts error based on relative position at initial testing 

Appendix C: Reliable change expressions

Reliable change (RC) model Predicted score (Y′) Standard error (SEComments 
Jacobson and Truax (1991) Y′ = initial test score (XSE[2] (Jacobson & Truax, 1991Does not correct for practice 
Does not incorporate retest variability 
Speer (1992) Y′[2] (Speer, 1992SE[2] (Jacobson & Truax, 1991Corrects for regression to mean due to measurement error 
Does not correct for practice 
Does not incorporate retest variability 
Chelune and colleagues (1993) Y′[1] (Chelune et al., 1993SE[2] (Jacobson & Truax, 1991Uniform correction for practice 
Does not incorporate retest variability 
McSweeny and colleagues (1993) Y′[3] (McSweeny et al., 1993SE[4] (McSweeny et al., 1993Corrects for practice, measurement error, and differential practice 
Uniform error term using only retest variability 
Charter (1996) Y′[2] (Speer, 1992SE[4] (McSweeny et al., 1993Corrects for regression to the mean due to unreliability & practice effect 
Uniform error term using only retest variability 
Crawford and Howell (1998) Y′[3] (McSweeny et al., 1993SE[5] (Crawford & Howell, 1998Corrects for practice and regression to the mean in predicted score 
Individualizes error term based on initial test score 
Temkin and colleagues (1999) Y′[1] (Chelune et al., 1993SE[1] (Christensen & Mendoza, 1986Uniform correction for practice 
Combines test and retest variability to give uniform error estimate 
Iverson and colleagues (2003) Y′[1] (Chelune et al., 1993SE[3] (Maassen, 2004Uniform correction for practice 
Combines test and retest variability for uniform error estimate 
Maassen and colleagues (2006) Y′[4] (Maassen et al., 2006SE[3] (Maassen, 2004Corrects for practice, regression to the mean and bias in predicted score 
Combines test and retest variability to give uniform error term 
Reliable change (RC) model Predicted score (Y′) Standard error (SEComments 
Jacobson and Truax (1991) Y′ = initial test score (XSE[2] (Jacobson & Truax, 1991Does not correct for practice 
Does not incorporate retest variability 
Speer (1992) Y′[2] (Speer, 1992SE[2] (Jacobson & Truax, 1991Corrects for regression to mean due to measurement error 
Does not correct for practice 
Does not incorporate retest variability 
Chelune and colleagues (1993) Y′[1] (Chelune et al., 1993SE[2] (Jacobson & Truax, 1991Uniform correction for practice 
Does not incorporate retest variability 
McSweeny and colleagues (1993) Y′[3] (McSweeny et al., 1993SE[4] (McSweeny et al., 1993Corrects for practice, measurement error, and differential practice 
Uniform error term using only retest variability 
Charter (1996) Y′[2] (Speer, 1992SE[4] (McSweeny et al., 1993Corrects for regression to the mean due to unreliability & practice effect 
Uniform error term using only retest variability 
Crawford and Howell (1998) Y′[3] (McSweeny et al., 1993SE[5] (Crawford & Howell, 1998Corrects for practice and regression to the mean in predicted score 
Individualizes error term based on initial test score 
Temkin and colleagues (1999) Y′[1] (Chelune et al., 1993SE[1] (Christensen & Mendoza, 1986Uniform correction for practice 
Combines test and retest variability to give uniform error estimate 
Iverson and colleagues (2003) Y′[1] (Chelune et al., 1993SE[3] (Maassen, 2004Uniform correction for practice 
Combines test and retest variability for uniform error estimate 
Maassen and colleagues (2006) Y′[4] (Maassen et al., 2006SE[3] (Maassen, 2004Corrects for practice, regression to the mean and bias in predicted score 
Combines test and retest variability to give uniform error term 

References

Abramson
I. S.
Reliable change formula query: A statistician's comments
Journal of the International Neuropsychological Society
 , 
2000
, vol. 
6
 pg. 
365
 
Basso
M. R.
Bornstein
R. A.
Lang
J. M.
Practice effects on commonly used measures of executive function across twelve months
The Clinical Neuropsychologist
 , 
1999
, vol. 
13
 (pg. 
283
-
292
)
Boyko
E. J.
Alderman
B. W.
The use of risk factors in medical diagnosis: Opportunities and cautions
Journal of Clinical Epidemiology
 , 
1990
, vol. 
43
 (pg. 
851
-
858
)
Charter
R. A.
Revisiting the standard errors of measurement, estimate, and prediction and their application to test scores
Perceptual and Motor Skills
 , 
1996
, vol. 
82
 (pg. 
1139
-
1144
)
Charter
R. A.
Feldt
L. S.
The relationship between two methods of evaluating an examinee's difference scores
Journal of Psychoeducational Assessment
 , 
2000
, vol. 
18
 (pg. 
125
-
142
)
Chelune
G. J.
Naugle
R. I.
Luders
H.
Sedlak
J.
Awad
I. A.
Individual change after epilepsy surgery: Practice effects and base-rate information
Neuropsychology
 , 
1993
, vol. 
7
 (pg. 
41
-
52
)
Christensen
L.
Mendoza
J. L.
A method assessing change in a single subject: An alteration of the RC index
Behavior Therapy
 , 
1986
, vol. 
12
 (pg. 
305
-
308
)
Crawford
J. R.
Garthwaite
P. H.
Comparing patients’ predicted test scores from a regression equation with their obtained scores: A significance test and point estimate of abnormality with accompanying confidence limits
Neuropsychology
 , 
2006
, vol. 
20
 (pg. 
259
-
271
)
Crawford
J. R.
Howell
D. C.
Regression equations in clinical neuropsychology: An evaluation of statistical methods for comparing predicted and obtained scores
Journal of Clinical and Experimental Neuropsychology
 , 
1998
, vol. 
20
 (pg. 
755
-
762
)
Heaton
R. K.
Temkin
N. R.
Dikmen
S. S.
Detecting change: A comparison of three neuropsychological methods using normal and clinical samples
Archives of Clinical Neuropsychology
 , 
2001
, vol. 
16
 (pg. 
75
-
91
)
Hermann
B. P.
Seidenberg
M.
Schoenfeld
J.
Peterson
J.
Leveroni
C.
Wyler
A. R.
Empirical techniques for determining the reliability, magnitude, and pattern of neuropsychological change after epilepsy surgery
Epilepsia
 , 
1996
, vol. 
37
 (pg. 
942
-
950
)
Hinton-Bayre
A. D.
Reliable change formula query
Journal of the International Neuropsychological Society
 , 
2000
, vol. 
6
 (pg. 
362
-
363
)
Hinton-Bayre
A. D.
Holding out for a reliable change from confusion to a solution: A comment on “The standard error in the Jacobson and Truax Reliable Change Index”
Journal of the International Neuropsychological Society
 , 
2004
, vol. 
10
 (pg. 
894
-
898
)
Hinton-Bayre
A. D.
Methodology is more important than statistics when determining reliable change
Journal of the International Neuropsychological Society
 , 
2005
, vol. 
11
 (pg. 
788
-
789
)
Hinton-Bayre
A. D.
Geffen
G. M.
Geffen
L. B.
McFarland
K.
Friis
P.
Concussion in contact sports: Reliable change indices of impairment and recovery
Journal of Clinical and Experimental Neuropsychology
 , 
1999
, vol. 
21
 (pg. 
70
-
86
)
Iverson
G. L.
Lovell
M. R.
Collins
M. W.
Interpreting change on ImPACT following sport concussion
The Clinical Neuropsychologist
 , 
2003
, vol. 
17
 (pg. 
460
-
467
)
Jacobson
N. S.
Follette
W. C.
Reverstorf
D.
Psychotherapy outcome research: Methods for reporting variability and evaluating clinical significance
Behavior Therapy
 , 
1984
, vol. 
15
 (pg. 
336
-
352
)
Jacobson
N. S.
Truax
P.
Clinical significance: A statistical approach to defining meaningful change in psychotherapy research
Journal of Consulting and Clinical Psychology
 , 
1991
, vol. 
59
 (pg. 
12
-
19
)
Maassen
G. H.
Principles of defining reliable change indices
Journal of Clinical and Experimental Neuropsychology
 , 
2001
, vol. 
22
 (pg. 
622
-
632
)
Maassen
G. H.
The standard error in the Jacobson and Truax reliable change index: The classical approach to the assessment of reliable change
Journal of the International Neuropsychological Society
 , 
2004
, vol. 
10
 (pg. 
888
-
893
)
Maassen
G. H.
Reliable change assessment in the sport concussion research: A comment on the proposal and reviews of Collie et al.
British Journal of Sports Medicine
 , 
2005
, vol. 
39
 (pg. 
483
-
488
)
Maassen
G. H.
Bossema
E. R.
Brand
N.
Reliable change assessment with practice effects in sports concussion research: A comment on Hinton-Bayre
British Journal of Sports Medicine
 , 
2006
, vol. 
40
 (pg. 
829
-
833
)
Maassen
G. H.
Bossema
E. R.
Brand
N.
Reliable change and practice effects: Outcomes of various indices compared
Journal of Clinical and Experimental Neuropsychology
 , 
2009
, vol. 
31
 (pg. 
339
-
352
)
McCaffrey
R. J.
Westervelt
H. J.
Issues associated with repeated neuropsychological assessments
Neuropsychology Review
 , 
1995
, vol. 
5
 (pg. 
203
-
221
)
McNemar
Q.
Psychological statistics
 , 
1963
3rd ed.
New York
Wiley
McSweeny
A. J.
Naugle
R. I.
Chelune
G. J.
Luders
H.
“T scores for change”: An illustration of a regression approach to depicting change in clinical neuropsychology
The Clinical Neuropsychologist
 , 
1993
, vol. 
7
 (pg. 
300
-
312
)
Pepe
M. S.
Janes
H.
Longton
G.
Leisenring
W.
Newcomb
P.
Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker
American Journal of Epidemiology
 , 
2004
, vol. 
159
 (pg. 
882
-
890
)
Salinsky
M. C.
Storzbach
D.
Dodrill
C. B.
Binder
L. M.
Test–retest bias, reliability, and regression equations for neuropsychological measures repeated over a 12–16 week period
Journal of the International Neuropsychological Society
 , 
2001
, vol. 
7
 (pg. 
597
-
605
)
Sherman
E. M. S.
Slick
D. J.
Connolly
M. B.
Steinbok
P.
Martin
R.
Strauss
E.
, et al.  . 
Reexamining the effects of epilepsy surgery on IQ in children: Use of regression-based change scores
Journal of the International Neuropsychological Society
 , 
2003
, vol. 
9
 (pg. 
879
-
886
)
Speer
D. C.
Clinically significant change: Jacobson and Truax (1991) revisited
Journal of Consulting and Clinical Psychology
 , 
1992
, vol. 
60
 (pg. 
402
-
408
)
Strauss
E.
Sherman
M. S.
Spreen
O.
A compendium of neuropsychological tests: Administration, norms and commentary
 , 
2006
3rd ed.
New York
Oxford
Temkin
N. R.
Standard error in the Jacobson and Truax reliable change index: The “classical approach” leads to poor estimates
Journal of the International Neuropsychological Society
 , 
2004
, vol. 
10
 (pg. 
899
-
901
)
Temkin
N. R.
Heaton
R. K.
Grant
I.
Dikmen
S. S.
Detecting significant change in neuropsychological test performance: A comparison of four models
Journal of the International Neuropsychological Society
 , 
1999
, vol. 
5
 (pg. 
357
-
369
)
Woods
S. P.
Childers
M.
Ellis
R. J.
Guaman
S.
Grant
I.
Heaton
R. K.
A battery approach for measuring neuropsychological change
Archives of Clinical Neuropsychology
 , 
2006
, vol. 
21
 (pg. 
83
-
89
)