## Abstract

There is an ongoing debate over the preferred method(s) for determining the reliable change (RC) in individual scores over time. In the present paper, specificity comparisons of several classic and contemporary RC models were made using a real data set. This included a more detailed review of a new RC model recently proposed in this journal, that used the within-subjects standard deviation (WSD) as the error term. It was suggested that the RC_{WSD} was more sensitive to change and theoretically superior. The current paper demonstrated that even in the presence of mean practice effects, false-positive rates were comparable across models when reliability was good and initial and retest variances were equivalent. However, when variances differed, discrepancies in classification across models became evident. Notably, the RC using the WSD provided unacceptably high false-positive rates in this setting. It was considered that the WSD was never intended for measuring change in this manner. The WSD actually combines systematic and error variance. The systematic variance comes from measurable between-treatment differences, commonly referred to as practice effect. It was further demonstrated that removal of the systematic variance and appropriate modification of the residual error term for the purpose of testing individual change yielded an error term already published and criticized in the literature. A consensus on the RC approach is needed. To that end, further comparison of models under varied conditions is encouraged.

## Introduction

The use of reliable change (RC) indices for tracking change in individuals’ test scores continues to proliferate within the health sciences. Metrics for assessing the statistical significance of individual change scores have been known for several decades (McNemar, 1962). The classic approach to determining RC is regularly attributed to Jacobson and Truax (1991). Essentially, RC seeks to determine whether an estimation of true change over time has occurred when standardized by dividing by the standard error of measurement of the difference. Over the years, there has been an ongoing debate and there remains considerable disagreement regarding the most appropriate method to use for determining the statistically significant change in an individual over time. It has been suggested that all RC models can be reduced to a basic formula RC = (*Y* − *Y*′)/SE; where *Y* is the actual post-test score, *Y*′ the predicted post-test score, and SE the standard error (Hinton-Bayre, 2005). A recent article in this journal demonstrated that the majority of popular RC models can be derived through basic test-retest statistics (Hinton-Bayre, 2010). It was suggested that researchers or clinicians could subsequently derive reliable statistics according to their preferred model in the absence of a consensus on the most valid model. The study also highlighted where models could be seen to diverge in their estimate of RC, which will be presented in more detail shortly. The astute follower of RC literature may have noted that the most recent model proposed by Lewis, Maruff, Silbert, Evered, and Scott (2007) in this journal had been omitted from the review mentioned. In the Lewis and colleagues article, the authors essentially proposed a new RC model using the within-subjects standard deviation (WSD) as the error term or SE. Moreover, the Hinton-Bayre (2010) paper only used two contrived case studies to highlight the discrepancies between the RC models reviewed. Thus, the current paper had two objectives. The first objective was to compare RC models using an actual data set, in particular, a group of nonclinical individuals retested on a series of neuropsychological measures. The second objective was to review the RC model proposed by Lewis and colleagues (2007).

## A Brief Chronology of RC Models

The debate over the most appropriate method for determining RC has been ongoing at least since Christensen and Mendoza (1986) modified the SE utilized in the original RC index reported by Jacobson, Follette, and Reverstorf (1984). The latter authors had used the standard error of measurement (SEM) to determine whether the difference between scores over time had changed significantly. Christensen and Mendoza proposed that the standard error of the difference between two test scores should be the appropriate error term. The calculation of which they provided as SE[1] in Table 1. In their now classic article, Jacobson and Truax (1991) conceded the inappropriateness of the SEM and proposed the use of the SE[2] (Table 1) when retest data were not readily available, and initial test and retest variances could be assumed to be equal. Since that time several models of RC have been proposed. The more salient of these include the model of Chelune, Naugle, Luders, Sedlak, and Awad (1993), which effectively corrected for the mean practice effects seen in repeated neuropsychological assessment. The same group in the same year also proposed a standardized regression-based (SRB) approach for determining RC (McSweeny, Naugle, Chelune, & Lüders, 1993). In the simple SRB model, the retest scores are regressed onto initial scores from a control group. The predicted score (*Y*′) for an individual is calculated via the least-squares regression line, and the SE is simply the standard deviation of the residuals. Many subsequent RC variants have been criticized for leading to unjustified probability statements (Maassen, 2001). Maassen (2004) went on to point out that Christensen and Mendoza had misinterpreted the SE formula required for testing individual difference scores, having provided a formula for the standard deviation of the difference scores. Although discussed in more detail later, it is worth noting here that this is the same error term used by Temkin, Heaton, Grant, and Dikmen (1999) and Rasmussen and colleagues (2001). In reference to classical test theory, Maassen has strongly and repeatedly argued for the error term to be calculated as seen in SE[3], Table 1 (Maassen, 2004, 2005; Maassen, Bossema, & Brand, 2006, 2009). It should be noted that Iverson, Lovell, and Collins (2003) and Maassen (2003) were the first to utilize an RC model using an adjustment for mean practice effect for the predicted score and the SE[3]. More recent debate has focused on the classic approach with modification for practice effects versus the standard regression approach (e.g., Hinton-Bayre, 2005; Maassen et al., 2006). A small number of studies have reported on both methods and compared classification rates, which often agree closely (e.g., Erlanger et al., 2003). Maassen and colleagues (2006) had suggested a modification to the regression-based approach through adjusting the manner in which the predicted score is calculated which they argued would provide a more accurate representation of true change. These authors continued the recommendation for the use of SE[3] and went on to compare several models in both theoretical and actual data sets (Maassen et al., 2009). Their eventual conclusion was that their approach was generally preferred providing a more conservative classification rate. Although they conceded that a standard regression approach might be appropriate when sensitivity is of significant interest. Since then, Hinton-Bayre (2010) demonstrated that most RC models can be derived with the possession of basic test–retest statistics. More importantly, it was shown that RC models will give different estimates based on the presence of practice effects, differential practice effects, test–retest reliability, and the individual's relative position compared with controls at initial testing. Hinton-Bayre also observed that the Maassen and colleagues (2009) approach might not be universally the most conservative, nor the standard regression approach the most sensitive. Ultimately, a theoretically sound, consensus approach to RC is desired. As a contribution to this goal, the present paper seeks to empirically compare RC models in contemporary research literature, including the RC_{WSD}, as proposed by Lewis and colleagues (2007).

Author | SE expression | Description |
---|---|---|

SE[1] (Christensen & Mendoza, 1986) | The standard deviation of difference scores | |

SE[2] (Jacobson & Truax, 1991) | The SEM of the difference score, when variances are equal | |

SE[3] (Iverson et al., 2003; Maassen, 2004) | The SEM of the difference score | |

SE[4] (Lewis et al., 2007) | The WSD | |

SE[5] (McSweeny et al., 1993) | The standard deviation of the least-squares regression residuals |

Author | SE expression | Description |
---|---|---|

SE[1] (Christensen & Mendoza, 1986) | The standard deviation of difference scores | |

SE[2] (Jacobson & Truax, 1991) | The SEM of the difference score, when variances are equal | |

SE[3] (Iverson et al., 2003; Maassen, 2004) | The SEM of the difference score | |

SE[4] (Lewis et al., 2007) | The WSD | |

SE[5] (McSweeny et al., 1993) | The standard deviation of the least-squares regression residuals |

*Notes:* SE = standard error; RC = reliable change; SEM = standard error of measurement; *X* = initial test raw score; *Y* = retest raw score; *S _{X}* = initial test standard deviation in the comparison group;

*S*= retest standard deviation in the comparison group;

_{Y}*r*= test–retest reliability coefficient;

_{XY}*n*= comparison group sample size.

## Foundations of the RC_{WSD}

Lewis and colleagues (2007) compared four RC methods. The standard error formulas used are presented in Table 1. The first two methods were originally described by Jacobson and Truax (1991) and Chelune and colleagues (1993). Notably, the Jacobson and Truax method does not correct for practice and uses only the pretest variance to estimate the standard error, assuming pre- and post-test variances are equivalent. The Chelune and colleagues method corrects for practice as seen in a control group, but uses the Jacobson and Truax error term. Next the RC_{ISPOCD} was considered (Rasmussen et al., 2001), this method uses the same predicted score *Y*′ as the Chelune method, but “uses the standard deviation of the change scores estimated from the control group used to derive the group mean practice effect” (Lewis et al., 2007, p250). As alluded to earlier, this error term can actually be traced back to Christensen and Mendoza (1986) as in the absence of raw scores this error can be calculated by an expression using basic test–retest statistics (Table 1). Furthermore, this is one of the error terms described by Temkin and colleagues (1999) in a comparison of various RC methods. Readers should thus be aware that the Rasmussen and colleagues (2001) RC_{ISPOCD} is not yet another method. Finally, the authors introduced the RC_{WSD}, which uses the WSD as the error term. Lewis and colleagues state that the WSD is “drawn from the estimate of residual error in a repeated measures analysis of variance that has compared performance over multiple time-points in a control group” (p. 251). They go on to advise that the WSD can be calculated as described by Bland and Altman (1996). Lewis and colleagues then argued in favor of the RC_{WSD} on the basis that it (i) tends to be more sensitive to impairment post-operatively, and (ii) it more truly reflects random error in determining RC (Mollica, Maruff, & Vance, 2004). On the latter point, the authors argued that the other error terms (e.g., RC_{ISPOCD}, Rasmussen et al., 2001) were tainted by the inclusion of systematic error. The present paper elaborates on the theoretical underpinnings and calculation of the WSD as an error term in RC. Ultimately, this paper will demonstrate that the WSD approach is inappropriate for use in RC in neuropsychology.

The RC_{WSD} as proposed by Lewis and colleagues (2007) can also be reduced to the basic formula RC = (*Y* − *Y*′)/SE. The predicted post-test score (*Y*′) Lewis and colleagues adopted is simply that used earlier by Chelune and colleagues (1993), where *Y*′ is simply the pretest score (*X*) plus the mean difference score or practice effect seen in an appropriate set of controls. The novelty of the RC_{WSD} approach lies in the derivation of the error term the authors label as WSD. Bland and Altman (1996) referred to this error as *S*_{W}, or the square-root of the mean squares (MS) residual for repeated measures, bearing in mind residual refers to unexplainable variance. Regularly, variances (or Sums of Squares, SS) in any ANOVA can be partitioned as follows: SS Total = SS between-people + SS within-people. In the repeated measures situation, the SS within-people can be further partitioned into SS between-treatment + SS residual (Winer, 1971). And ordinarily the *F*-ratio of interest is the MS between-treatment/MS residual, with the systematic influence of subject differences (MS between-people) removed. Doyle and Doyle (1997) make exactly this critique and go on to point out there is indeed a significant between-treatment effect in the Bland and Altman data. Note that, between-treatment variation is explained variance that is controlled by correcting for the practice effect in the numerator when deriving *Y*′ and thus should not appear in the error variance. The *S*_{W} (or WSD) is calculated by finding the variance of the repeated scores for each subject and then finding the square-root of the average of these values (Bland & Altman, 1996). This calculation process effectively combines the between-treatment variance with the residual variance, and thus a systematic variance with a random error. Moreover, the square-root of the MS residual yields the standard deviation of the difference scores (SE[1]) in a two-level test–retest scenario, and as such its use as an error term for individual change departs from the classical test theory (*see*Maassen, 2004). Thus, the theoretical benefit of the RC_{WSD} as proposed by Lewis and colleagues (2007, p. 255) seems unfounded.

Bland and Altman (1996) described the use of the *S*_{W} (or WSD) for evaluating differences between a subject's measured and true value – measurement error. They then went on to briefly refer to “repeatability,” which is a concept perhaps more familiar in the biological, chemical, and physical sciences, where multiple measures are made of the same subject, with the same measure, by the same observer, under the same conditions, over a short period of time (Chinn, 1991). In a situation where repeated observations are affected only by chance, the WSD may be a reasonable measure of within-people variability. However, in the setting of repeated neuropsychological assessment, or even repeated respiratory effort, a learning effect is often seen. This systematic learning effect is manifest through the between-treatment variance, which when present can be removed from the within-people variance as described above (Doyle & Doyle, 1997). Bland and Altman actually refer to the *S*_{W} (or WSD) when examining variability from a true to a measured score. In the absence of a between-treatment or practice effect, the WSD would be an appropriate error term in the setting of a single assessment. Thus, the use of the WSD as an error term for RC parallels the mistake made by Jacobson and colleagues (1984), as pointed out by Christensen and Mendoza (1986). To interpret the difference between two measured scores Bland and Altman stipulate the following error term, . Thus by extension, it appears that Bland and Altman would suggest that this is the appropriate error term for determining RC. Moreover, it can be readily demonstrated that the in the two level scenario—usually seen in RC—will exactly equal the standard deviation of the difference scores, when the MS residual does not incorporate MS between-treatments. Again, this is the same error term proposed by Christensen and Mendoza (1986) and subsequently used by Temkin and colleagues (1999) and Rasmussen and colleagues (2001) in the RC_{ISPOCD}. The increased sensitivity reported by Lewis and colleagues (2007) is solely due to the usually smaller value of the WSD in comparison to the other error terms evaluated—given that the same numerator is used in all but the Jacobson and Truax (1991) methods. Thus, the practical benefit of greater sensitivity ascribed to the WSD is flawed on the basis of the inappropriateness of its use as an error term when examining the difference between two measured scores in the setting of practice effects.

A research note was published by Maassen (2010) that elaborated on the inappropriateness of the WSD as an error term for determining RC. Using simulated data sets, Maassen demonstrated that in the absence of practice effects (i.e., difference between initial and retest variances or means) the WSD tends to under-estimate error and thus over-estimate false-positive rates. Furthermore, when large practice effects were present Maassen found that the WSD overestimated error and thus became effectively insensitive to change. This should not be surprising given that the WSD in calculation—as will be demonstrated—combines systematic variance, as manifest by mean practice effects, into the error estimate. The present paper seeks to compare the false-positive rate (1-specificity) for a selection of models as reviewed by Hinton-Bayre (2010), while also incorporating the Lewis and colleagues (2007) model. In particular, consideration of where models differ in false-positive classification will be examined.

## Method

A sample of 104 male rugby league players completed the Speed of Comprehension Test (SOCT; Baddeley, Emslie, & Nimmo-Smith, 1992), Digit Symbol Substitution Test (DSST) from the WAIS-R (Wechsler, 1981), and Symbol Digit Modalities Test (SDMT) written version (Smith, 1982), on two occasions 1–3 weeks apart. On average, the sample was 18.5 years of age (*SD*= 2.5), had 11.9 years of education (*SD*= 1.0), and an estimated full scale IQ of 88.5 (*SD*= 8.5)—based on National Adult Reading Test-Revised (Crawford, 1992). These data were collected in a larger study prospectively examining the effects of concussion in contact sport. A limited portion of this data has been previously published (Hinton-Bayre, 2004).

## RC Models

RC models can be broadly separated into mean practice and regression-based approaches. Recall that all RC models can be reduced to RC = (*Y* − *Y*′)/SE. The mean practice approaches, as the term suggests, all estimate the *Y*′ by adding the practice effect (difference between means of test and retest: *M _{Y}* −

*M*) to the initial test score (

_{X}*X*). The four models considered differ only in terms of the SE implemented. The Chelune and colleagues (1993) models uses SE[2] (Table 1). Temkin and colleagues (1999) and Rasmussen and colleagues (2001) effectively used SE[1]. The Iverson and colleagues (2003) and Maassen (2004) used SE[3]. And finally, Lewis and colleagues used SE[4]. The differences between these models vary as a function of the difference between test and retest variances, or differential practice (Maassen, 2005), and test reliability. The differences are predictable and will be reconsidered shortly.

Two regression-based models were considered. The SRB model is attributed to McSweeny and colleagues (1993) and is calculated using a least-squares regression line to obtain *Y*′ and the standard error of estimate as the corresponding error term, SE[5]. The predicted score *Y*′ = *b* (slope of least-squares regression line) + *a* (constant). The slope of the line *b* is usually calculated as the square-root of covariance of *XY* divided by variance of *X*: . The coefficient *b* can also be calculated by multiplying the Pearson's product moment correlation by the standard deviation of retest scores divided by the standard deviation of the initial test scores: *b = r _{XY}* × (

*S*/

_{Y}*S*). Maassen and colleagues (2006) suggested an adjustment to the regression model where the regression coefficients are corrected for proposed bias, where

_{X}*b*

_{adj}=

*b*/

*r*and thus

_{XY}*b*

_{adj}=

*S*/

_{Y}*S*. The

_{X}*a*

_{adj}constant is calculated in the traditional manner using the respective

*b*coefficient, for example,

*a*

_{adj}=

*M*−

_{Y}*b*

_{adj}

*M*. Maassen and colleagues further recommended the use of SE[3] as the corresponding error term.

_{X}## Results

Table 2 provides the descriptive and inferential statistics generated. Significant practice effects were observed for all three measures, chiefly due to the statistically large sample size. A significant difference between initial and retest variances, referred to as differential practice (Maassen, 2005), was observed for the SOCT and SDMT. Retest variability was greater than initial test variability for SOCT suggesting divergence from the mean. Conversely, retest variance was less than initial test variance for SDMT, suggesting regression to the mean.

SOCT | DSST | SDMT | |
---|---|---|---|

Initial test mean | 45.44 | 57.75 | 53.98 |

Retest mean | 53.41 | 63.76 | 55.88 |

Initial test (SD) | 16.66 | 11.35 | 10.88 |

Retest (SD) | 19.47 | 11.62 | 9.34 |

r _{XY} | 0.83 | 0.70 | 0.65 |

Mean practice (t) | 7.57 | 6.90 | 2.26 |

Differential practice (t) | 3.89 | 0.44 | 2.61 |

Standard error | |||

SE[1] CM | 10.74 | 8.88 | 8.55 |

SE[2] JT | 9.58 | 8.78 | 9.07 |

SE[3] Iverson/Maassen | 10.42 | 8.88 | 8.46 |

SE[4] WSD | 9.43 | 7.56 | 6.16 |

SE[5] SRB | 10.73 | 8.29 | 7.08 |

RC classification: + positive change/−negative change | Correlation between Z and RC scores (in parentheses) _{X} | ||

Temkin | 7/4 (−0.04) | 6/6 (−0.36) | 8/8 (−0.56) |

Chelune | 9/6 (−0.04) | 6/6 (−0.36) | 5/7 (−0.56) |

Iverson | 7/4 (−0.04) | 6/6 (−0.36) | 8/8 (−0.56) |

WSD | 9/6 (−0.04) | 6/8 (−0.36) | 12/10 (−0.56) |

SRB | 7/3 (−0.01) | 5/4 (−0.01) | 7/5 (−0.01) |

Maassen | 5/5 (−0.29) | 6/6 (−0.39) | 6/7 (−0.42) |

SOCT | DSST | SDMT | |
---|---|---|---|

Initial test mean | 45.44 | 57.75 | 53.98 |

Retest mean | 53.41 | 63.76 | 55.88 |

Initial test (SD) | 16.66 | 11.35 | 10.88 |

Retest (SD) | 19.47 | 11.62 | 9.34 |

r _{XY} | 0.83 | 0.70 | 0.65 |

Mean practice (t) | 7.57 | 6.90 | 2.26 |

Differential practice (t) | 3.89 | 0.44 | 2.61 |

Standard error | |||

SE[1] CM | 10.74 | 8.88 | 8.55 |

SE[2] JT | 9.58 | 8.78 | 9.07 |

SE[3] Iverson/Maassen | 10.42 | 8.88 | 8.46 |

SE[4] WSD | 9.43 | 7.56 | 6.16 |

SE[5] SRB | 10.73 | 8.29 | 7.08 |

RC classification: + positive change/−negative change | Correlation between Z and RC scores (in parentheses) _{X} | ||

Temkin | 7/4 (−0.04) | 6/6 (−0.36) | 8/8 (−0.56) |

Chelune | 9/6 (−0.04) | 6/6 (−0.36) | 5/7 (−0.56) |

Iverson | 7/4 (−0.04) | 6/6 (−0.36) | 8/8 (−0.56) |

WSD | 9/6 (−0.04) | 6/8 (−0.36) | 12/10 (−0.56) |

SRB | 7/3 (−0.01) | 5/4 (−0.01) | 7/5 (−0.01) |

Maassen | 5/5 (−0.29) | 6/6 (−0.39) | 6/7 (−0.42) |

*Notes:* RC = reliable change; SOCT = Speed of Comprehension Test; DSST = Digit Symbol Substitution Test; SDMT = Symbol Digit Modalities Test; CM = Christensen and Mendoza; JT = Jacobson and Truax; WSD = within-subjects standard deviation; SRB = standardized regression-based; *Z _{X}* = standardized initial test score.

The present paper sought to compare the error terms and ultimately classification rates for the six RC models described earlier. The Jacobson and Truax (1991) RC model does not account for practice effects and thus will not be further considered here as estimates of change will be biased. Table 2 also presents SE estimates for the error terms noted in Table 1. It can be readily appreciated that the SE_{WSD} produces a smaller error value on each test, consistent with the observation by Lewis and colleagues (2007). The SE[1] and SE[3] were equal when *S _{X}* was equivalent to

*S*, as was the case for DSST. SE[3] was always less than or equal to SE[1] as has been previously demonstrated (Maassen, 2004). In the setting of significant differential practice (e.g., SOCT and SDMT), SE[2], which assumes equality of variances, provides a lesser or greater estimate of error respectively, compared with SE[1] and SE[3]. The linear regression error term, SE[5], was relatively large in the setting of divergence from the mean (

_{Y}*S*<

_{X}*S*) seen for SOCT. However, when test–retest variances were equal (DSST) or there was evidence of regression to the mean (

_{Y}*S*>

_{X}*S*; SDMT), SE[5] was relatively smaller compared with other error terms, excluding SE

_{Y}_{WSD}.

Table 2 goes on to present the classification rates of the six RC models. Theoretically, given the sample size of 104, one would expect roughly five individuals to produce an RC score greater than +1.645 and another five individuals an RC score less than −1.645, assuming a confidence level of 90% two-tailed. In the case of the DSST where reliability was good and variances equal, the classification rates approximated the theoretical distribution across RC models. Where there was divergence from the mean on the SOCT, most models provided a classification rate commensurate with chance levels. Notably, the RC model of Chelune and colleagues (1993) and the RC_{WSD} yielded the highest false-positive classifications. Where there was regression to the mean on retesting on the SDMT accompanied by less impressive reliability, false-positive rates were most varied across models. Nonetheless, the RC_{WSD} provided the greatest false-positive classification rate among RC models examined consistently across each of the tests. A series of goodness-of-fit *χ*^{2} analyses (*p*< .05) were conducted to evaluate whether the classifications of change differed from the theoretical distribution, that is, >10 significantly improved or deteriorated. The 21% false-positive rate for the RCWSD on SDMT was the only one to reach statistical significance. The alleged sensitivity of the WSD as an error term (SE[4]) as described by Lewis and colleagues (2007) came at the expense of consistently reduced specificity in real data.

Further examination of Table 2 classifications suggests that there are differences in the symmetry of classification for RC models. Under a standard normal distribution of RC scores, one would expect approximately 5% of cases in either tail to be significantly changed due to chance. When variances were equal (e.g., DSST) there was good symmetry of classification. With divergence from the mean on retesting (e.g., SOCT), there was a trend toward more positive RC changes. Furthermore, when there was evidence of regression to the mean on retesting (e.g., SDMT), the asymmetry was more obvious for models where only one or the other variance was used to estimate error, namely SE[2] and SE[5], using initial and retest variances, respectively. The most consistent symmetry was seen for the Maassen RC model.

The RC models based on the mean practice effect do not correct for extremeness of the initial test score, whereas regression-based models do. Individual data have not been presented; however, inspection of the individual RC scores across models revealed that the standardized deviation or extremeness of the initial (*Z _{X}*) score [(

*X*−

*M*)/

_{X}*S*] did not appear to predispose that individual to later significant change on retesting for any RC model on any of the three measures. However, correlations between initial standardized scores and subsequent RC scores did vary across models for different test conditions. Correlations between initial test

_{X}*Z*scores and subsequent RC scores, presented in Table 2, approximated zero when there was evidence of differential practice such that

_{X}*S*<

_{X}*S*, suggesting divergence from the mean (e.g., SOCT), for mean practice effect models (

_{Y}*r*= −.04) and the SRB (

*r*= −.01). Interestingly, the Maassen model suggested a weak negative relationship between initial

*Z*scores and RC scores (

_{X}*r*= −.29) for SOC. When initial and retest variances were equal (e.g., DSST), the correlation was also weakly negative for mean practice effect models (

*r*= −.36) and the Maassen model (

*r*= −.39), but essentially zero (

*r*= −.01) for the SRB. Finally, when there was evidence of regression to the mean on retesting (e.g., SDM), initial standard scores were moderately negatively correlated with RC scores for mean practice effect models (

*r*= −.56), but less so with the Maassen model (

*r*= −.42), and not correlated under the SRB model (

*r*= −.01). Thus, the SRB model appears to be the only one that does not appear to influence the outcome when scores are more extreme at the outset. It should be noted that these statements of relationship perhaps only hold in the presence of significant practice effects.

Table 3 presents the inferential statistics for the each of the three measures. As mentioned, the SS between-people is not of interest. Now, it can be demonstrated that the sum of SS between-treatment and SS residual divided by the sum of *df* between-treatment and *df* residual gives the within-subjects variance, the square root of which yields the WSD, such that SOCT WSD = √88.85 = 9.43, DSST WSD = √57.15 = 7.56, and SDMT WSD = √37.99 = 6.16. This clearly demonstrates that the WSD incorporates systematic and error variance (cf. SE[4] values in Table 2).

Test | Source | SS | df | MS | F-value |
---|---|---|---|---|---|

SOCT | Between people | 61674.42 | 103 | 598.78 | 10.39 |

Between treatment | 3304.04 | 1 | 3304.04 | 57.33 | |

Residual | 5936.46 | 103 | 57.64 | ||

Between-treatment + residual | 9240.50 | 104 | 88.85 | ||

DSST | Between people | 23101 | 103 | 224.28 | 5.68 |

Between treatment | 1878.0 | 1 | 1878.0 | 47.58 | |

Residual | 4065.50 | 103 | 39.47 | ||

Between-treatment + residual | 5943.50 | 104 | 57.15 | ||

SDMT | Between people | 17 417.42 | 103 | 169.10 | 4.63 |

Between treatment | 186.58 | 1 | 186.58 | 5.11 | |

Residual | 3763.92 | 103 | 36.54 | ||

Between-treatment + residual | 3950.50 | 104 | 37.99 |

Test | Source | SS | df | MS | F-value |
---|---|---|---|---|---|

SOCT | Between people | 61674.42 | 103 | 598.78 | 10.39 |

Between treatment | 3304.04 | 1 | 3304.04 | 57.33 | |

Residual | 5936.46 | 103 | 57.64 | ||

Between-treatment + residual | 9240.50 | 104 | 88.85 | ||

DSST | Between people | 23101 | 103 | 224.28 | 5.68 |

Between treatment | 1878.0 | 1 | 1878.0 | 47.58 | |

Residual | 4065.50 | 103 | 39.47 | ||

Between-treatment + residual | 5943.50 | 104 | 57.15 | ||

SDMT | Between people | 17 417.42 | 103 | 169.10 | 4.63 |

Between treatment | 186.58 | 1 | 186.58 | 5.11 | |

Residual | 3763.92 | 103 | 36.54 | ||

Between-treatment + residual | 3950.50 | 104 | 37.99 |

*Notes:* SOCT = Speed of Comprehension Test; DSST = Digit Symbol Substitution Test; SDMT = Symbol Digit Modalities Test.

## Discussion

The present paper had two aims. The first was to compare the false-positive rate (hence specificity) in a collection of well-known and recent RC models using real data. This was achieved in a select sample of relatively homogenous individuals, that is, young male athletes, who were retested a median of 2 weeks apart. The three measures of focus in this study were subject to practice effects and moderate to good reliability. The tests differed most clearly on differential practice, else termed inequality of variance. Recent previous studies have either focused on limited case example comparison of models (Hinton-Bayre, 2010) or theoretical data sets (Maassen, 2010). Thus, this paper may be considered an extension of these earlier works. The second aim was to highlight the RC_{WSD} of Lewis and colleagues (2007), the most recent edition to the RC family. In particular, this paper sought to demonstrate that the improved sensitivity to change alleged by its authors is based on faulty assumptions regarding what constitutes statistical error. In this way, the current paper operationalizes a recent theoretical critique of the RC_{WSD} (Maassen, 2010). It should also be noted that errors occurring in Table 1 from Lewis and colleagues have been amended and the data represented in an Appendix.

A series of important findings were elucidated by this study. As suggested in earlier work (Hinton-Bayre, 2010), when initial and retest variances are equal, RC models will tend to agree regarding classifications of significant RC. Concordantly, departures from this ideal situation witnesses increasing variations in classification amongst the RC models examined. These data also reinforce the issue that where there is potential for mean practice effects on retesting, as is commonly seen in performance-based measurement, that there is potential for differential practice. In this journal, Hinton-Bayre (2010) has already suggested that differential practice appears to play a significant role in providing more varied estimates of RC across different models. In the present data examined, it was observed that the WSD was consistently the smallest SE, and furthermore the RC_{WSD} had consistently the highest false-positive rate of all models (13%–21%). Given the limited sample size for examining significant deviations from a chance false-positive rate of 10%, the RC_{WSD} on the SDMT was the only occasion a statistically significant increase in false-positive classification was seen. Lewis and colleagues (2007) had claimed that the SE_{WSD} provided an unbiased estimate of error. This paper clearly demonstrates that in the setting of practice effects, the SE_{WSD} actually combines systematic and residual errors, as seen in Table 3. Moreover, when the systematic effect of error is removed and the error calculated as alluded to by Bland and Altman (1996), this yields an error term already existent in the literature (RC_{ICSPOD} and RC_{Temkin}). This latter error term has been criticized for departing from the classical test theory (Maassen, 2004; Maassen et al., 2006). In the current data set, the most consistent classification rate with the best symmetrical false-positive classification rate was the RC_{Maassen}.

The findings of this paper have significant implications for the implementation of RC statistics in retest analysis of individual data measured on a continuum. If ideal psychometric properties are present, then the model used is likely of little consequence. A previous work has demonstrated that when reliability is good, variances are equivalent, and the control/normative sample is adequately matched that the RC model employed is of lesser concern (Hinton-Bayre, 2010). However, departures from these ideals show discordance between models. In the absence of an agreed approach, one could argue that RC analysis may not be appropriate unless certain criteria regarding these parameters are met. It has been suggested that reliability be >0.70, there should not be differential practice, and the case should fall within 1*SD* of the control sample at initial testing (Hinton-Bayre, 2010). Empirical investigation of these suggestions is wanting. When test and retest data are available, the use of SE[2] is questionable, given the extra information provided by the possession of retest variance. The question remains over how to proceed when test and retest variances are not equivalent. Should the investigator shy away from RC analysis altogether, or should they incorporate a model that considers such a discrepancy (e.g., the RC_{SRB})? Further consideration of the scope of application of RC models is required. It is clear, however, that the RC_{WSD} is not theoretically or practically suited to the task of RC. As argued recently by Maassen (2010), the WSD is a measure of variation for a single score, not a change score. Thus, the increase sensitivity proposed by Lewis and colleagues (2007) is illusory, which may even paradoxically be reversed in the setting of large practice effects (Maassen, 2010). In the current real test–retest data set of performance measures demonstrating practice effects, the RC model of Maassen and colleagues (2006) provided the most consistent and symmetrical false-positive rate approximating chance levels. However, the SRB was the only RC model to not demonstrate a relationship between initial retest performance and subsequent RC values. It should be noted that no obvious link was seen between actual extreme scores at initial testing and significant RC scores on retest.

Naturally, the limitations of the present study bring to bear significant caveats regarding the results described. The sample, although well powered to examine for practice and differential practice effects, was not well powered to examine discrepancies in classification at the 90% confidence level. The pattern of results is also limited to measures with significant practice effects. This is seen as a plus given it is regularly the norm in performance measurement. We do not see the select nature of the sample as a limitation given the theoretical focus of this paper, and also the relative homogeneity would have facilitated reduction in individual difference error. It is noted, however, that this paper comments only on specificity of RC models. A clinical or experimental group is required to compare sensitivity of RC models. Nonetheless, a control sample enables the examination of the individual change distributions for an RC model. With a confidence interval of 90%, one should expect to see a false-positive rate of 10% and this provides a rational and testable evaluation of RC model accuracy. Variations in clinical or experimental effect will naturally provide an added dimension to consistency of classification, with larger effects or more variable effects producing differing effects on RC models.

It is not presently clear whether one should choose an RC model based on a sensitivity/specificity tradeoff or based on theoretical appropriateness to the psychometric conditions at hand. As noted previously, multiple test parameters will affect RC values (Hinton-Bayre, 2010). Systematic examination of variations in such parameters (i.e., practice effects, differential practice, reliability, individual comparability to controls) would help to further elucidate discrepancies and consistencies across models. In this way either a universally appropriate RC model or a decision process for selecting one under certain conditions can be considered. Such investigation may yield circumstances where it is questionable whether an RC approach is appropriate at all!

This paper has demonstrated that the RC models compared tend to agree in the presence of practice effects, when there is limited evidence of differential practice. To this end, it is sensible to consider that a minimum reliability be considered when making decisions about individual change. Whether 0.7 is too lenient or strict requires further consideration. It is also again clear that differential practice has a profound differential influence on RC models. It is not clear whether RC models are appropriate at all in this setting, or whether a model that accounts for such discrepancy is preferable, for example, a regression-based model. In the current data set, the Maassen RC model provided the most consistent and symmetric classification. It was also clear that the WSD as described by Lewis and colleagues (2007) has neither a theoretical nor practical benefit over its competitor error terms in RC and is in fact inappropriate for the task. The WSD was never intended to be used for comparison of actual scores for an individual according to its original authors (*see* also Chinn, 1991). Bland and Altman (1996) were also not particularly describing a situation where a learning effect was expected, although one was demonstrated in their data (Doyle & Doyle, 1997). Removal of the between-level variance from the WSD yields a true residual, reflecting unexplained variance. Use of this latter residual would yield exactly the error term used in RC analyses by Temkin and colleagues (1999) and Rasmussen and colleagues (2001). Moreover, this error term has been strongly criticized in the literature for its apparent deviation from the classic test theory. For a more contemporary discussion of the various RC models, the interested reader is directed to Maassen and colleagues (2009), Maassen (2010) and Hinton-Bayre (2010). A consensus on an RC model or approach to selecting a model for determining individual change would be ideal. To that end, continued comparison of RC models in theoretical, normative, and clinical/experimental data sets is required. The author is able to provide a spreadsheet to facilitate the calculation of different RC model scores on request. The author is able to provide a spreadsheet to facilitate the calculation of different RC model scores on request.

### Appendix

RCI | Formula | Error and group change estimates used in the RCI equations |
||||||
---|---|---|---|---|---|---|---|---|

WLT | TMTA | TMTB | COWAT | DSST | GPD | GPND | ||

_{ΔXC} |
0.8 (3.8) | −0.1 (15.6) | 8.5 (24.7) | 4.5 (10.5) | −1.5 (8.3) | −7.6 (14.5) | −0.1 (15.6) | |

RCI_{J&T} |
3.54 | 12.85 | 26.49 | 11.02 | 8.90 | 12.64 | 12.10 | |

RCI_{ISPOCD} |
3.80 | 15.59 | 24.70 | 10.47 | 8.28 | 14.48 | 15.58 | |

RCI_{Chelune} |
3.54 | 12.85 | 26.49 | 11.02 | 8.90 | 12.64 | 12.10 | |

RCI_{WSD} |
2.73 | 10.96 | 18.37 | 8.02 | 5.92 | 11.53 | 10.96 |

RCI | Formula | Error and group change estimates used in the RCI equations |
||||||
---|---|---|---|---|---|---|---|---|

WLT | TMTA | TMTB | COWAT | DSST | GPD | GPND | ||

_{ΔXC} |
0.8 (3.8) | −0.1 (15.6) | 8.5 (24.7) | 4.5 (10.5) | −1.5 (8.3) | −7.6 (14.5) | −0.1 (15.6) | |

RCI_{J&T} |
3.54 | 12.85 | 26.49 | 11.02 | 8.90 | 12.64 | 12.10 | |

RCI_{ISPOCD} |
3.80 | 15.59 | 24.70 | 10.47 | 8.28 | 14.48 | 15.58 | |

RCI_{Chelune} |
3.54 | 12.85 | 26.49 | 11.02 | 8.90 | 12.64 | 12.10 | |

RCI_{WSD} |
2.73 | 10.96 | 18.37 | 8.02 | 5.92 | 11.53 | 10.96 |

*Notes:* Δ*X* = individual time 2 performance − time 1 performance; Δ*X*_{C} = mean control group time 2 performance − time 1 performance; , where *r _{xx}* = test–retest reliability of the measure; = standard deviation of the control group change; = within subject standard deviation of the control group. Tasks: WLT = CERAD word-learning task; TMTA = Trail Making Task part A; TMTB = Trail Making Task part B; COWAT = Controlled Oral Word Association Task; DSST = Digit Symbol Substitution Task; GPD = Grooved Pegboard Dominant; GPND = Grooved Pegboard Nondominant.