## Abstract

In this Journal, Lewis and colleagues introduced a new Reliable Change Index (RCI_{WSD}), which incorporated the within-subject standard deviation (WSD) of a repeated measurement design as the standard error. In this note, two opposite errors in using WSD this way are demonstrated. First, being the standard error of measurement of only a single assessment makes WSD too small when practice effects are absent. Then, too many individuals will be designated reliably changed. Second, WSD can grow unlimitedly to the extent that differential practice effects occur. This can even make RCI_{WSD} unable to detect any reliable change.

Lewis, Maruff, Silbert, Evered, and Scott (2007) introduced a new Reliable Change Index (RCI_{WSD}), using the *within-subject standard deviation* (WSD) of a repeated measurement design as the standard error (hereafter abbreviated as s.e.). According to these authors, the WSD provides a substantial advantage: As an s.e., it turns out to be notably smaller than the standard errors currently in use. As a consequence, the RCI_{WSD} appeared to be much more sensitive in designating individuals as reliably changed. In this note I wish to clarify two major misunderstandings in using WSD. As a start, it is useful to review the RCIs proposed in the literature that are relevant for this discussion. Some of them are currently in use and have also been applied by Lewis and colleagues (2007).

Jacobson, Follette, and Revenstorf (1984) were the first to use the term RCI, proposing a statistic composed as a ratio of an individual's observed change (Δ*X*) in the numerator and an s.e. *S*_{ED} in the denominator, which they expressed as follows:

*X*and

*Y*, respectively, are the pre- and posttest scores of a certain outcome measure observed in a control group, and

*r*

_{XY}is the test–retest reliability. In a letter to the editor of the journal where this RCI was published, Christensen and Mendoza (1986) noted that s.e.(1) was not correct, because it incorporates only the unreliability of one assessment, whereas the assessment of change should involve the unreliability of two measurements. Nevertheless, expression (1) is cited here because it is relevant for the present note, as will be shown later. Complying with the recommendations by Christensen and Mendoza, Jacobson and Truax (1991) proposed the following adjusted standard error: where

*S*

_{EX}is the s.e. of measurement of the pretest. The reader will see that the correction by Christensen and Mendoza comprehends multiplication of the s.e. by a factor √2. The RCI with this s.e. is currently widely used and is cited as RCI

_{J&T}by Lewis and colleagues (2007). In fact, it is an adaptation of the full formula of the s.e.

*of measurement of the difference score*as a function of the variance and reliability coefficient of pre- and posttest, already given by McNemar (1962) in terms of population parameters: S.e.(2) comprehends estimating the population parameters by their values observed in a control sample, assuming that (i) initial and final variance are equal; (ii) the reliability coefficients of pre- and posttest are equal; and (iii) their common value can be estimated by the test–retest reliability. Maassen (2004) has argued that, although the population values of pretest and posttest variance may be equal, pooling the two control sample values provides a better estimate, making it less vulnerable to a deviant pretest variance value in the control sample. Therefore, he proposed

Jacobson and colleagues applied their RCIs within the context of psychotherapy research, where it is reasonable to assume that the practice effects do not occur, which is tantamount to assuming equal pre- and posttest variance in the control group. In neuropsychology, however, tests of cognitive function are frequently used and the occurrence of practice effects should be anticipated. Researchers should apply methods of assessing reliable change that account for these effects. Lewis and colleagues (2007) did apply three such RCIs, composed of Δ*X**−* Δ*X*_{c} in the numerator and three different standard errors in the denominator. Here, Δ*X*_{c} is the mean change observed in the control sample, interpreted as the mean practice effect, implying that each individual's observed change score is corrected for practice effect by subtracting the mean practice effect in the control sample. As a denominator Lewis and colleagues (2007) used:

_{ISPOCD}. S.e.(5) is not consistent with the original definition of assessing a reliable change, since in fact it measures whether an individual change in the experimental group should be considered extreme when compared with the changes established in the control group, regardless of the source of error (random or non-random) (Crawford, Howell, & Garthwaite, 1998, p. 902; Maassen, 2004). The standard errors (4) and (5) are equal if all the practice effects are zero or equal to a constant, because adding a constant to the posttest score then does not change the posttest variance and the correlation between pre- and posttest score. However, when differential practice effects occur RCI

_{ISPOCD}bears the contradiction of incorporating a numerator that is decreased because of the subtraction of an estimation of the practice effect, and a denominator that is increased with the variance of the practice effects (Maassen, 2004). In that case, RCI

_{ISPOCD}is always more conservative than the RCI incorporating s.e.(4). Lewis and colleagues (2007) referred to Bland and Altman (1996) who showed how this within-subject standard deviation can be viewed as an s.e.

*of measurement*. To demonstrate the calculation and the applicability of the WSD, I return to Bland and Altman's note and the data matrix that is central to their text (Bland and Altman, 1996, Table 1). For the present discussion, only the structure of this matrix is relevant, not the content. This matrix shows the scores of 20 persons who were assessed four times, organized in 20 rows corresponding to the 20 individuals, and four columns corresponding to the four consecutive measurements. Bland and Altman argued that the variance of the four scores in each row, calculated as a sample variance (i.e.,

*df*= 3), can be regarded as estimation of the error variance (i.e., the s.e.

*of measurement*squared) of the test under consideration for the corresponding individual. The s.e. of measurement of a certain outcome measure is usually assumed to be equal across individuals. Thus, the average value of the 20 estimations is regarded as the eventual estimation of this common s.e. squared. This value happens to be the Residual Mean Square (within-subject variance) in the one-way ANOVA with 20 levels (individuals) and four independent replications per level. Bland and Altman's argument is correct, provided that the four scores in each row stem from the same distribution, i.e., from a distribution with the same mean, which, according to Classical Test Theory (CTT), is defined as the true score of the corresponding individual. The argument is clearly not correct when the consecutive scores in a row are affected by effects of testing, which will probably be the case regarding the second, third, and fourth assessment in each row. Consequently, the mean of each row will deviate from the estimate of the true score of a single measurement. Whatever the WSD estimates, it is larger than the estimation of the s.e. of measurement of a single assessment. Unfortunately, Bland and Altman did not emphasize the crucial assumption of zero practice effects, which caused the first misunderstanding in Lewis and colleagues (2007). As noted above, they implemented WSD (=

*S*

_{w}) in an RCI that aimed to account for practice effects.

s.e.(2), leading to an RCI quoted as RCI

_{CHELUNE};the standard deviation of the difference scores in the control sample

- (c)
WSD as denominator, leading to RCI

_{WSD}.

Situation no. | π | M_{X} | M_{Y} | SD_{X} | SD_{Y} | r_{XY} | Standard error | |||
---|---|---|---|---|---|---|---|---|---|---|

Formula (2) | Formula (4) | Formula (5) | WSD | |||||||

1 | 0 | 99.43 | 98.75 | 11.25 | 10.91 | .7901 | 7.29 | 7.18 | 7.19 | 4.78 |

2 | N(1,0.2) | 100.78 | 101.09 | 11.40 | 6.69 | .5737 | 10.53 | 8.63 | 9.34 | 6.26 |

3 | N(5,1) | 100.92 | 104.97 | 11.11 | 7.30 | .6489 | 9.31 | 7.88 | 8.45 | 6.31 |

4 | N(25,5) | 100.14 | 125.13 | 12.05 | 9.01 | .5807 | 11.03 | 9.74 | 10.01 | 18.78 |

Situation no. | π | M_{X} | M_{Y} | SD_{X} | SD_{Y} | r_{XY} | Standard error | |||
---|---|---|---|---|---|---|---|---|---|---|

Formula (2) | Formula (4) | Formula (5) | WSD | |||||||

1 | 0 | 99.43 | 98.75 | 11.25 | 10.91 | .7901 | 7.29 | 7.18 | 7.19 | 4.78 |

2 | N(1,0.2) | 100.78 | 101.09 | 11.40 | 6.69 | .5737 | 10.53 | 8.63 | 9.34 | 6.26 |

3 | N(5,1) | 100.92 | 104.97 | 11.11 | 7.30 | .6489 | 9.31 | 7.88 | 8.45 | 6.31 |

4 | N(25,5) | 100.14 | 125.13 | 12.05 | 9.01 | .5807 | 11.03 | 9.74 | 10.01 | 18.78 |

*Note*: π = practice effect; the column below π shows the normal distribution of the practice effects.

The second misunderstanding in Lewis and colleagues (2007) originates in ignoring the paragraph in Bland and Altman's text on the difference of two measurements for the same subject. Although Bland and Altman probably were not acquainted with the concept of reliable change, they showed to be aware that the assessment of change should reckon with the unreliability of two measurements. However, as can be derived from the foregoing discussion, by using WSD as s.e., Lewis and colleagues (2007) incorporated only the s.e. of measurement (the unreliability) of one measurement, while the numerator of their RCI contains random error variability of two measurements. Thus, they made the same mistake as Jacobson and colleagues (1984) initially did by proposing s.e.(1), but which mistake was already corrected in the literature about 25 years ago.

Readers who are not familiar with test theory may be puzzled by the theoretical discussion above. They will probably be more convinced by a demonstration of the consequences in practice. From the tables of Lewis and colleagues (2007), it can be ascertained that, for all the outcome variables involved, the WSD is notably smaller than the other standard errors, leading to their claim that RCI_{WSD} is always more sensitive. Unfortunately, it is impossible to indicate the consequences in the results presented by Lewis and colleagues in terms of percentages of fallible designations, because it is unknown whether the participants really did deteriorate or not. However, it is possible to indicate the consequences of the errors globally and qualitatively, which I will do first.

Although, in principle, WSD can be calculated on the basis of more than two time points, Lewis and colleagues (2007, Table 1) presented the WSD outcomes based on only two measurements for each member of the control group. In that case, WSD can be compared with the other standard errors quoted above. For instance, algebraically it can be easily shown (will not be done here) that the following relationship holds between the squares of s.e.(4) and WSD:

When the effects of testing do not occur, the mean difference observed in the control sample is expected to be zero, while the initial and final variance are expected to be equal. In that case, the square of s.e.(4), as well as the square of s.e.(2) and (5), is expected to be about twice as large as WSD^{2}. The WSD is then only about 1/√2 = 0.7 times s.e.(2), (4) and (5).

Imagine now the situation where, *ceteris paribus* (same control group, same outcome variable), all the members of the control group benefit from the same practice effect. [Considering the numerator of the RCIs utilized by Lewis and colleagues (2007), they assumed that this was the case in their samples.] On the one hand, except for RCI_{WSD}, this will leave the RCIs unchanged. Note that the numerator of all the RCIs remains unchanged because adding the same constant to the final score does not change the difference between an individual change and the average change in the control group. [We have already noted earlier that then the s.e.(2), (4) and (5) also remain unchanged.] On the other hand, this lowers the RCI_{WSD} outcomes because expression (6) tells us that WSD will grow to the extent that practice effects occur (i.e., to the extent that the difference between initial and final variance and the difference between initial and final mean observed in the control group differ from 0). In principle, when the practice effects are sufficiently large, WSD can even grow to a value that makes RCI_{WSD} unable to detect the false positives in the control group, and what is worse, reliable changes in the experimental group.

### Numerical Example

We will now illustrate to what extent the general principles discussed above are recognized when analyzing more realistic data. To this end, we created four data files that represent the following situations: (1) negative true change (deterioration) and zero practice effects; (2) negative true change and small practice effects; (3) negative true change and moderate practice effects; (4) negative true change and very large practice effects. The reader will notice that the situations (2) or (3) probably most closely resemble the actual analyses of Lewis and colleagues (2007). Because large-sized control groups is of great importance (Maassen, Bossema, and Brand, 2009), in our experiment we constructed control groups of size *N* = 250, rather than *N* = 90 used by Lewis and colleagues. Each of the four data files comprises an experimental group of size *N* = 50. For every "individual," an initial score *X* (pretest) and a final score *Y* (posttest) are created. Each initial score is the sum of a normally distributed true score (μ = 100, σ = 10) and a normally distributed measurement error (μ = 0, σ = 5). The final score *Y* is the sum of the same true score and a normally distributed measurement error with the same mean and standard deviation. For every individual in the situations (2), (3), and (4), a practice effect is added to the true score component. The practice effect is normally distributed (μ = 1, σ = 0.2 in situation 2; μ = 5, σ = 1 in situation 3; μ = 25, σ = 5 in situation 4) and via a linear regression formula positively linked (β = 0.5) to the true initial score. This reflects the assumption that individuals with higher initial score more benefit from a practice effect, which is often the case. Finally, for every individual in the *experimental* group, a true change component is added, which is zero for about half of the individuals, while for the other half a value drawn from the negative half of the normal distribution with μ = 0 and σ = 10 is added. This leads to negative changes of comparable size (i.e., about the same mean and standard deviation) in the experimental groups of the four situations. (For a formal description of the construction of the data see Maassen et al. 2009.)

## Results

Table 1 shows the outcomes of the statistics calculated in the control group that are required for the RCI formulas. Line 1 shows only small differences between s.e.(2), (4), and (5) caused by measurement errors. When practice effects occur (lines 2, 3, and 4), s.e.(2) is always the largest s.e. because the posttest variance then turns out smaller than the pretest variance [see also expression (2)], as a consequence of the positive correlation between pretest score and practice effect. As expected, s.e.(5) is never smaller than s.e.(4). In the situations with zero or small practice effects, WSD is not far from being 0.7 times s.e.(4). In line 4, large practice effects occur, making WSD by far the largest standard error.

Table 2 shows the numbers of type I errors (false positives) in the control group. When using a 90% confidence interval, about 10% false positives are anticipated. In the cases where zero to moderate practice effects occur, RCI_{WSD} delivers much too large numbers of reliable change designations, which is of course a consequence of the relatively small WSD values. In line 4, no false positive is delivered, as a consequence of the large practice effects and the large value of WSD.

Situation no. | π | Standard error (2) | Standard error (4) | Standard error (5) | WSD | ||||
---|---|---|---|---|---|---|---|---|---|

det | imp | det | imp | det | imp | det | imp | ||

1 | 0 | 4.0 | 4.4 | 4.0 | 4.4 | 4.0 | 4.4 | 13.2 | 13.6 |

2 | N(1,0.2) | 3.6 | 2.0 | 6.8 | 5.6 | 5.2 | 3.6 | 14.4 | 12.0 |

3 | N(5,1) | 2.4 | 3.6 | 5.2 | 6.0 | 4.4 | 4.8 | 12.0 | 11.2 |

4 | N(25,5) | 4.4 | 3.6 | 6.0 | 5.6 | 6.0 | 5.6 | 0 | 0 |

Situation no. | π | Standard error (2) | Standard error (4) | Standard error (5) | WSD | ||||
---|---|---|---|---|---|---|---|---|---|

det | imp | det | imp | det | imp | det | imp | ||

1 | 0 | 4.0 | 4.4 | 4.0 | 4.4 | 4.0 | 4.4 | 13.2 | 13.6 |

2 | N(1,0.2) | 3.6 | 2.0 | 6.8 | 5.6 | 5.2 | 3.6 | 14.4 | 12.0 |

3 | N(5,1) | 2.4 | 3.6 | 5.2 | 6.0 | 4.4 | 4.8 | 12.0 | 11.2 |

4 | N(25,5) | 4.4 | 3.6 | 6.0 | 5.6 | 6.0 | 5.6 | 0 | 0 |

*Note:* π = practice effect; the column below π shows the normal distribution of the practice effects. Numbers in Percentages.

Abbreviations *det* = number of subjects classified as reliably deteriorated; *imp* = number of subjects classified as reliably improved.

In fact, the other three standard errors and corresponding RCIs are only appropriate in situation (1) (no practice effects). In that situation, the three RCI formulas yield exactly the same numbers, being somewhat smaller than the expected 10% designations. In the situations (2), (3), and (4), the three RCIs deliver percentages of reliable changes that more or less differ from 10. The disadvantage of RCI_{CHELUNE} as being dependent on the difference between initial and final variance is particularly revealed in the lines 2 and 3, where the posttest variance is much smaller than the pretest variance and the numbers of reliable change designations are too low.

Table 3 shows the numbers of reliable change designations in the experimental sample, as well as (between parentheses) the numbers of persons that really deteriorated. The results presented in the table mirror the results presented in the previous table: (a) relatively high numbers of reliable change designated by RCI_{WSD} in the situations where zero or small practice effects occur and zero designations in the situation with very large practice effects; (b) equal numbers of designations by the other methods when no practice effects occur; (c) smaller numbers of reliable change designations yielded by RCI_{CHELUNE}.

Situation no. | Δ | π | Standard error (2) | Standard error (4) | Standard error (5) | WSD | ||||
---|---|---|---|---|---|---|---|---|---|---|

det | imp | det | imp | det | imp | det | imp | |||

1 | <0 | 0 | 11 (10) | 1 (0) | 11 (10) | 1 (0) | 11 (10) | 1 (0) | 18 (16) | 4 (2) |

2 | <0 | N (1,0.2) | 17 (11) | 0 | 22 (15) | 0 | 20 (14) | 0 | 26 (16) | 0 |

3 | <0 | N (5,1) | 9 (8) | 1 (0) | 10 (8) | 2 (0) | 9 (8) | 2 (0) | 11 (8) | 3 (1) |

4 | <0 | N (25,5) | 5 (5) | 3 (1) | 8 (8) | 3 (1) | 7 (7) | 3 (1) | 0 | 0 |

Situation no. | Δ | π | Standard error (2) | Standard error (4) | Standard error (5) | WSD | ||||
---|---|---|---|---|---|---|---|---|---|---|

det | imp | det | imp | det | imp | det | imp | |||

1 | <0 | 0 | 11 (10) | 1 (0) | 11 (10) | 1 (0) | 11 (10) | 1 (0) | 18 (16) | 4 (2) |

2 | <0 | N (1,0.2) | 17 (11) | 0 | 22 (15) | 0 | 20 (14) | 0 | 26 (16) | 0 |

3 | <0 | N (5,1) | 9 (8) | 1 (0) | 10 (8) | 2 (0) | 9 (8) | 2 (0) | 11 (8) | 3 (1) |

4 | <0 | N (25,5) | 5 (5) | 3 (1) | 8 (8) | 3 (1) | 7 (7) | 3 (1) | 0 | 0 |

*Note:* Between parentheses, the number of subjects who experienced true deterioration. Δ = average true change. π = practice effect; the column below π shows the normal distribution of the practice effects.

Abbreviations: *det* = number of subjects classified as reliably deteriorated; *imp* = number of subjects classified as reliably improved.

Unfortunately, the table also reveals the shortcomings of the reliable change methodology, being a procedure that aims to unravel several effects from limited information. We know that about half of the “participants” experienced real deterioration. None of the variants is able to detect all these cases. On the other hand, in particular in situation (2), a considerable number of cases have been designated reliably deteriorated while they actually did not change. Moreover, in most of the situations, several cases were designated reliably improved while they actually did not change or even did deteriorate. These findings should be taken as a warning to researchers not to trust unconditionally the outcomes of reliable change assessments by means of the methods here applied.

## Discussion

This commentary responds to an article by Lewis and colleagues (2007) that reports the numbers of patients whose cognitive function did deteriorate after surgery. They have chosen to use two current RCIs: (1) RCI_{J&T}, which expresses the observed change corrected for practice effects relative to the s.e. of measurement (the original concept of reliable change); (2) RCI_{ISPOCD}, which expresses the observed change corrected for practice effects relative to the standard deviation of the changes observed in the control group. In this commentary, we also introduced (3), an RCI advocated by Maassen (2004), which also expresses the observed change corrected for practice effects relative to the measurement error, but which is statistically more appropriate than RCI_{J&T}.

Lewis and colleagues sought “to calculate the RCI by expressing the change in terms of just the random error associated with the assessment”, and they state “Such an estimate of random variance is readily available using the WSD”. What they sought is indeed consistent with the original definition of reliable change assessment. They proposed to use a new RCI, quoted as RCI_{WSD}, accounting for practice effects in the same way as the other three RCIs, but incorporating the within-person standard deviation as standard error.

Our theoretical exposé and the results of our experiment perfectly converge in showing that WSD is inappropriate as an s.e. of an RCI. In the case of two assessments, the use of WSD entails two mistakes, which affect RCI_{WSD} in opposite directions. First, if no practice effects are present, RCI_{J&T}, RCI_{CHELUNE}, and the RCI advocated by Maassen account for the unreliability of two measurements, which is correct when measurement of change is concerned, while WSD is a correct estimate of the s.e. of measurement of only one measurement. When no or small practice effects occur, WSD is much too small as a s.e. of an RCI, which results in delivering much too high numbers of reliable change. Occurrence of small practice effect probably characterizes the data of Lewis and colleagues (2007), leading them to mistakenly boasting that their RCI is substantially more sensitive than the current statistics.

Second, the occurrence of practice effects (equal or differential) enlarges WSD. The growth of WSD is commensurate with the size of the practice effects and WSD may become much larger than the currently accepted standard errors. In principle, this second error may compensate the first error, leading to the results similar to those of the current methods, or even overshadow the first error to an extent that RCI_{WSD} is unable to detect any reliable change.

The RCIs used in this commentary are the same as or similar to the RCI statistics used by Lewis and colleagues (2007). The numerator of these RCIs evolves from the assumption that all the participants benefit from the same practice effect, which is not very realistic. If differential practice effects are anticipated, a paramount procedure is the widely used regression-based approach of McSweeny, Naugle, Chelune, and Lüders (1993). A comprehensive discussion that also presents other options can be found in Maassen et al. (2009).

## Conflict of Interest

None declared.

## References

*t*test