Abstract

Objectives. To evaluate and compare the validity and reliability of individual and composite recall pain intensity measures.

Design. Secondary analyses using data from a published 14-day open-label crossover clinical trial comparing two active treatments.

Setting. Multiple settings.

Participants. Fifty-two adults with a history of chronic cancer pain.

Measures. Recall ratings of least, worst, and average pain during the past 2 days; composite score representing recalled characteristic pain in the past 2 days; and daily diary ratings of pain intensity from which “actual” least, worst, and average pain scores were derived.

Results. Recall ratings of least and average pain, and a composite score representing recalled characteristic pain were accurate (differed no more than three points from “actual” scores on a 0–100 scale). Although the recall rating of worst pain significantly (P < 0.05) overestimated actual worst pain, the differences were minor (i.e., seven to eight points on a 0–100 scale). All of the recall measures demonstrated validity via their strong associations with the measures of actual pain intensity. The recall measures also demonstrated excellent test–retest stability, although the diary-derived measures tended to be more stable than the recall measures did. The composite measure of recalled characteristic pain demonstrated a high level of internal consistency (Cronbach's α = 0.90).

Conclusions. Individual recall ratings and a composite score representing recalled characteristic pain intensity are reliable and valid measures of actual pain in patients with cancer. The findings support their use as outcome measures in clinical trials.

Introduction

Recall ratings of pain intensity (“How would you rate the average intensity of your pain in the past week?”) are commonly used in pain clinical trials [1–3]. However, research over the past decade has called into question the validity of such ratings [4,5]. In particular, there is evidence that patients tend to overestimate the pain they experienced in the past. However, the amount of overestimation found is usually negligible, tending to range from 1% to 8% of the possible range of pain scales [6–9], well within the error variance of these measures [10]. Even in the rare event that the difference between recalled and actual pain has been shown to be greater than 8% of the possible range of the pain scale [5,11,12], the differences reported are less than the minimal clinical difference established for visual analog scale (VAS) (i.e., less than 2.0 on a 0–10 scale and less than 20 on a 0–100 scale) [13].

Research also indicates that pain recall ratings show strong associations with actual average ratings taken from diaries. The correlation coefficients between recalled and actual average pain intensity have been shown to range from 0.68 to 0.99, with an average coefficient of 0.79 across 14 studies [5–8,11,12,14–21]. Similar coefficients are found for the correlations between recall ratings of least pain and worst pain and measures of actual least and worst pain ratings derived from diaries, with both also averaging 0.79 across three studies [12,16,21].

Questions regarding the extent to which recall ratings reflect actual pain levels are questions about validity. The evidence reviewed indicates that recall ratings contain a substantial amount of valid variance as measures of actual pain. A second important psychometric criterion for measures is reliability, that is, the extent to which they are free from measurement error [22]. The issue of reliability of pain measurement is particularly important to clinical pain trialists because pain measures with more reliability should be more responsive to pain treatments than measures with less reliability. Thus, even if recall ratings have adequate validity—as the available evidence indicates they do—if they are inadequately reliable, there is a risk that their use in a clinical trial may result in inadequate assay sensitivity, that is, their use may result in an effective treatment being deemed ineffective. Also, the use of unreliable measures requires a larger number of subjects to detect treatment effects than would be required to detect a treatment response if a more reliable measure were used.

One common method used for increasing reliability in outcome assessment is to obtain multiple measures of the domain of interest and then averaging these measures into a single composite score [23]. Assuming that each individual score contains at least some valid variance and that the measurement error contained in the score is due to random factors (e.g., idiosyncrasies about a particular rating scale or how each person responded to it), the measurement error gets “averaged out” while the valid variance remains, resulting with a systematic decrease in measurement error as each additional item or score is averaged into the composite score. Thus, one should see an increase in reliability as the number of items that contribute to a composite score increases.

Jensen and McFarland illustrated this expected pattern by demonstrating how two measures of reliability (test–retest stability and internal consistency) improve as a number of individual pain ratings (of current pain) that are combined into a single composite score of usual pain are increased [24]. They found, for example, that a single rating of current pain correlated 0.63 with a single rating of current pain assessed 1 week later. On the other hand, a composite score made up of an average of 28 ratings of current pain correlated 0.95 with another composite score made up of 28 ratings obtained from the next week [24]. Interestingly, however, although reliability did systematically increase as the number of individual scores that contributed to the composite increased as expected, it only took two ratings of current pain to create a pain intensity score with a very high (0.79) test–retest stability. This suggests the possibility that relatively few pain ratings may be required to create pain intensity measures with high reliability and assay sensitivity. Supporting this hypothesis, Jensen and McFarland also found that two ratings of current pain evidenced a very high level of internal consistency (Cronbach's α of 0.84, reflecting the extent to which the items of a measure assess the same construct).

The findings published to date regarding the reliability and validity of composite and recall pain intensity measures can potentially be interpreted in ways that lead to different conclusions regarding the most useful methods to assess pain intensity. For example, one could focus on the finding that people tend to overestimate pain when providing single recall ratings (e.g., of pain in the past week) and therefore decide that single recall ratings should not be used in pain clinical trials. This would support a decision to use methods that require participants in clinical trials to provide multiple ratings of pain and then use these ratings (instead of recall ratings) to create scores representing accurate measures of least (e.g., the lowest pain level reported), worst (e.g., the highest pain level reported), and average (e.g., an average of all pain ratings) pain experienced by the participant. However, such assessment procedures can be expensive, given the hardware and software requirements for data collection, as well as the costs associated with hardware and software management. These procedures also result in significant assessment burden for participants and can therefore result in more missing data than would occur with simpler (and perhaps as valid and reliable) assessment procedures.

On the other hand, one could focus on the evidence supporting the reliability and validity of recall ratings, arguing that even if they overestimate pain to a negligible degree at times, their use in clinical trials does not necessarily result in overestimations of changes in pain with treatment. For example, someone who reports an average pain intensity rating of “80” out of 100 before treatment and “30” after treatment may have actually experienced pain intensities of 75 and 25; yet the change in pain computed by using the recall ratings (80 − 30 = 50) would be the same as that computed had actual pain scores been available (75 − 25 = 50); slight inaccuracies in recall ratings do not necessarily translate to inaccuracy in identifying changes in pain. Moreover, if demonstrated to be reliable, given the fact that recall ratings require participants in trials to only report on their pain at the beginning and a single follow-up point in time, their use could dramatically reduce assessment burden. This could result in less missing data—which would increase the generalizability of findings—and also significantly reduce costs in clinical trials.

In addition, although much of the research and discussion regarding the validity and reliability of recall pain measures has focused on individual recall ratings (e.g., single recall ratings of least, worst, and average pain over a specified time period), psychometric theory would suggest that a composite score created by averaging individual recall ratings into a single measure would result in a score that has an even higher level of validity and reliability than individual ratings do. Preliminary support for this possibility was reported by Broderick and colleagues, who examined the psychometric properties of composite scores made up of multiple ratings of average pain in the past day (i.e., “What was the average level of your pain today?”) [7]. They found that a single end-of-day (EOD) recall rating correlated about 0.78 with a composite score representing actual average pain (over a previous 7-day period). Importantly, however, a composite score of two EOD ratings correlated about 0.87 with the actual average pain rating score, and each time an additional EOD rating was added to the composite score, the validity coefficient increased further, consistent with psychometric theory.

It might be even easier to create a composite score of recall ratings obtained at a single point in time rather than at the end of the day over multiple days (e.g., “Please rate the worst pain you have experienced in the past 24 hours,”“… least pain …” and “… average pain …”) and combine these three individual ratings into a single composite score representing “characteristic pain”[25,26]. Such a measure might yield many of the benefits of composite scores (i.e., high levels of validity and reliability) without the costs and burdens associated with diary measures. In fact, some preliminary studies support the potential validity of such measures. For example, characteristic pain scores made up of recall ratings of worst, least, and average pain show very strong associations with actual average pain scores computed from pain diaries (coefficients of 0.85–0.86) [14,19].

The current study sought to expand our knowledge regarding the validity and reliability of recall pain intensity measures, with particular interest in the psychometric properties of a characteristic pain score created from the recall ratings of worst, least, and average pain. Using data from a published 14-day open-label crossover trial that included both 1) multiple measures of current pain assessed each day and 2) 2-day recall ratings assessed on the 7th and 14th day, we compared the reliability and validity of recall pain ratings with scores of “actual” pain derived from diary ratings. Based on previous research, reviewed above, we hypothesized that the 2-day recall ratings of least, worst, and average pain would show a negligible overestimation (i.e., between 0 and 10 points on a 0–100 scale) of actual least, worst, and average pain scores derived from the responses in the daily diaries. We also hypothesized strong correlations (rs = ∼0.79) between recalled and actual least, worst, and average pain. Although some change in pain over the course of the trial for individual patients was possible, given that all subjects were switched from one analgesic to another in the study, we hypothesized that pain intensity levels during the first week of the trial would evidence strong test–retest stability (i.e., be strongly associated with pain intensity levels during the second week of the trial) and that a composite recall score of characteristic pain would show a high level of test–retest stability from 1 week to the next, comparable with a composite score of average pain created from multiple diary ratings of current pain. Finally, we hypothesized that the internal consistency of scores made up of multiple ratings (including a characteristic pain score made up by averaging the three recall ratings, as well as composite pain intensity scores made up of all of the available diary ratings from the final 2 days of each treatment week) would be high and similar to each other.

Materials and Methods

Study Design and Population

We obtained the data for the secondary analyses presented here from a completed and previously published clinical trial [27]. Briefly, the trial was an open-label 14-day crossover trial designed to evaluate the effectiveness and safety of an extended-release (ER) formulation of oxymorphone administered following treatment with controlled-release (CR) morphine sulfate or CR oxycodone. Patients with cancer pain were stabilized for 3 or more days on either morphine CR or oxycodone CR (the selection was based on patients' previous use or investigator preference). They were then treated for 7 days at the stabilized dose and then crossed over for 7 days of treatment of an equianalgesic dose of oxymorphone ER.

Measures

Daily Diary Measures of Worst, Least, and Average Pain Intensity

Current pain intensity was assessed four times each day during the 14-day treatment period using paper-and-pencil diaries to record scores for a VAS. The VAS consisted of a 10-cm line with the terms “No pain” and “Excruciating pain” below the endpoints. Participants were asked to make a mark on the 10-cm line that would indicate the severity of their current pain. The VAS has a great deal of evidence supporting its validity as a measure of pain intensity [28]. Participants were asked to rate their current pain four times each day starting at 8 am and ending at 8 pm. They brought their completed diaries with them to a clinic visit at the end of each treatment period (i.e., days 7 and 14 of the trial).

Data from the diaries were used to derive “actual”1 worst, least, and average pain VAS ratings during the last 2 days of each treatment period. The 2-day actual least VAS pain scores were defined as the lowest VAS score recorded during days 6 and 7 and days 13 and 14. The 2-day actual worst VAS pain ratings were the highest VAS scores recorded during days 6 and 7 and days 13 and 14. The 2-day actual average VAS scores were the arithmetic mean of the pain intensity ratings provided on days 6 and 7 and days 13 and 14. Scores were derived for all subjects who provided at least 75% of the ratings on days 6 and 7 and days 13 and 14 (i.e., minimum of six ratings within each 2-day period prior to the recall rating).

Recalled Pain Intensity

At the end of each 7-day treatment period, during the scheduled clinic visit, participants were asked to recall their worst, least, and average pain during the previous 2 days using the same VAS that was used to assess pain intensity on the daily diary (described above). These ratings were averaged to create a composite measure of recalled characteristic pain, and analyses examined the validity and reliability of both the individual recall ratings and characteristic pain score.

Statistical Analysis

Summary statistics for the demographic variables were computed to describe the sample. To test the hypothesis that 2-day recall ratings of worst, least, and average pain would overestimate actual least, worst, and average pain, we performed a series of t-tests comparing the recalled least, worst, and average 2-day VAS ratings (obtained on study days 7 and 14) with the actual least, worst, and average 2-day pain intensity scores derived from the diary VAS ratings. To test the hypothesis that the associations between recalled and actual ratings would be strong, we computed Pearson's correlation coefficients between the three recall VAS ratings (of least, worst, and average pain in the past 2 days, as well as the mean of these three) from days 7 and 14 with the actual 2-day VAS ratings of least, worst, and average scores derived from the diaries. To test the hypothesis that all of the pain intensity measures including 1) the three individual 2-day recall VAS ratings, 2) the single composite score created by averaging the three 2-day recall VAS ratings, and 3) the actual 2-day worst, least, and average VAS ratings) would be stable over a 7-day period and show similar levels of test–retest stability, we computed Pearson's correlation coefficients between each of these measures from the first week of treatment with the same measures from the second week of treatment. Finally, to test the hypothesis that the internal consistency of scales made up of multiple ratings (including the composite score of recalled characteristic pain made up of the three recall ratings, as well as scales made up of all of the available ratings from the final 2 days of each treatment week) would be high and similar to each other, we computed Cronbach's αs for 1) composite scores made up of the three VAS recall ratings obtained on days 7 and 14 and 2) composite scores made up of 2 day's worth of current VAS pain intensity diary ratings from 2 days prior to when recall ratings were performed.

Results

Patient Accounting and Description

Eighty-six participants entered the trial. Thirty-four were assigned to receive morphine CR and 52 were assigned to receive oxycodone CR. Sixty-three participants completed stabilization, and 59 of these completed the entire trial. Complete data for the variables used in the current analyses were available for 52 of these participants. The mean age of the 52 participants in the current analyses was 52.9 years (standard deviation [SD] = 11.41, range of 28–77 years). The average duration of having cancer was 5.1 years (SD = 4.62, range of 0.1–18.6 years). Female patients comprised 56% of the participants. White patients comprised 88%, 10% were Black, and 2% were Hispanic. The location of pain varied, with 19 (37%) experiencing bony or soft tissue pain, 8 (15%) experiencing neuropathic pain, 4 (8%) experiencing visceral pain, 20 (38%) experiencing mixed pain, and 1 (2%) experiencing pain of unknown origin. Of the 52 patients, 15 (29%) were in the morphine CR group during the first period, 37 (71%) were in the oxycodone CR group during the first period, and all 52 (100%) were in the oxymorphone ER group during the second period.

Accuracy of the Recall Ratings

As predicted, all of the differences between the 2-day recall ratings of least, worst, average, and the characteristic pain score obtained on days 7 and 14 and the actual 2-day least and average pain scores derived from the diaries were negligible (less than 10 points on the 0–100 scale) (see Table 1). In fact, the differences between the recall and actual scores were not statistically significant for the ratings of least and average pain nor for the characteristic pain score. Only the VAS 2-day recall ratings of worst pain assessed on day 7 and day 14 were significantly higher than the actual (diary) ratings of worst pain from the diaries of the previous week, and even these differences were small (7.29 and 8.32 points on the 0–100 scale, respectively).

Table 1

Mean 2-day recall ratings and actual VAS pain intensity scores for least, worst, and average pain (VAS on 0–100 scale)

 Recall Mean (SD) Actual Mean (SD) t Value P Value 
Days 6 and 7     
  Least pain 17.93 (20.74) 16.38 (20.67) −0.04 0.9696 
  Worst pain 56.87 (28.17) 49.98 (28.26) 2.45 0.0177 
  Average pain 33.22 (23.22) 32.69 (24.44) 0.21 0.8380 
  Characteristic pain 36.01 (21.04)  −1.20 0.2340* 
Days 13 and 14     
  Least pain 16.38 (20.67) 19.00 (22.25) −1.27 0.2089 
  Worst pain 59.38 (30.58) 51.06 (29.37) 2.97 0.0044 
  Average pain 31.49 (22.72) 33.96 (24.43) −1.13 0.2640 
  Characteristic pain 35.75 (21.70)  −0.60 0.5530* 
 Recall Mean (SD) Actual Mean (SD) t Value P Value 
Days 6 and 7     
  Least pain 17.93 (20.74) 16.38 (20.67) −0.04 0.9696 
  Worst pain 56.87 (28.17) 49.98 (28.26) 2.45 0.0177 
  Average pain 33.22 (23.22) 32.69 (24.44) 0.21 0.8380 
  Characteristic pain 36.01 (21.04)  −1.20 0.2340* 
Days 13 and 14     
  Least pain 16.38 (20.67) 19.00 (22.25) −1.27 0.2089 
  Worst pain 59.38 (30.58) 51.06 (29.37) 2.97 0.0044 
  Average pain 31.49 (22.72) 33.96 (24.43) −1.13 0.2640 
  Characteristic pain 35.75 (21.70)  −0.60 0.5530* 
*

Compared with actual average pain from the diaries.

Average of recalled least, worst, and average pain.

SD = standard deviation; VAS = visual analog scale.

Validity of the Recall Ratings

The correlations between the six 2-day VAS recall scores and the measures of actual least, worst, average, and characteristic pain derived from the diaries are presented in Table 2. As can be seen, all of the coefficients between the VAS recall scores and actual VAS ratings from the diaries were strong (range = 0.65–0.84).

Table 2

Correlations between the 2-day VAS recall scores and actual VAS pain intensity scores for least, worst, average, and characteristic pain

 Days 6 and 7 Days 13 and 14 
Least pain 0.65*** 0.75*** 
Worst pain 0.73*** 0.79*** 
Average pain 0.70*** 0.78*** 
Characteristic pain 0.78*** 0.84*** 
 Days 6 and 7 Days 13 and 14 
Least pain 0.65*** 0.75*** 
Worst pain 0.73*** 0.79*** 
Average pain 0.70*** 0.78*** 
Characteristic pain 0.78*** 0.84*** 
***

P < 0.001.

Average of recalled least, worst, and average pain, correlated with actual average scores derived from the diaries.

VAS = visual analog scale.

Test–Retest Stability

The test–retest stability for all of the recall and diary-derived measures is presented in Table 3. As can be seen, the test–retest stability coefficients were all strong (range = 0.57–0.75) for the recall scores. However, inconsistent with the study hypothesis, the three VAS diary-derived scores (2-day least, worst, and average pain) tended to be stronger than those associated with the recall ratings (range = 0.68–0.87).

Table 3

Test–retest stability of all pain intensity measures examined in this study

 Test–Retest Stability Coefficients (r
Recall scores  
  Least pain 0.75*** 
  Worst pain 0.65*** 
  Average pain 0.57*** 
  Characteristic pain 0.73*** 
Diary-derived measures  
  2-day least pain 0.87*** 
  2-day worst pain 0.68*** 
  2-day composite average pain 0.83*** 
 Test–Retest Stability Coefficients (r
Recall scores  
  Least pain 0.75*** 
  Worst pain 0.65*** 
  Average pain 0.57*** 
  Characteristic pain 0.73*** 
Diary-derived measures  
  2-day least pain 0.87*** 
  2-day worst pain 0.68*** 
  2-day composite average pain 0.83*** 
***

P < 0.001.

Average of recalled least, worst, and average pain.

Internal Consistency

The internal consistency (Cronbach's αs) coefficients for the characteristic pain scores computed from the VAS recall rating composite scores (made up of the 2-day recalled least, worst, and average pain ratings from days 7 and 14) were both 0.90. Cronbach's αs for the 2-day VAS average pain scores from diary days 6 and 7 and diary days 13 and 14 were 0.91 and 0.86, respectively.

Discussion

There are four key findings from this study. First, and consistent with previous findings, the recall ratings of least, worst, and average pain and a recall characteristic pain score were all negligibly higher than the “actual” scores of these intensity domains derived from the diaries. Second, and also consistent with previous research, all of the 2-day VAS recall measures of pain intensity demonstrated strong associations with the actual pain intensity scores derived from the diaries. Third, all of the pain intensity measures, including both recall ratings and measures derived from daily diary ratings, evidenced high levels of test–retest stability from 1 week to another. Fourth, and inconsistent with the study hypotheses, the composite scores of average pain made from individual VAS measures from the diaries tended to have stronger test–retest stability than the recall measures did. Finally, all of the composite scores representing usual pain, including a composite score of characteristic pain made up of three VAS recall ratings, evidenced very high levels of internal consistency. These findings have important implications for the selection of pain intensity measures in clinical trials and raise important questions that need to be addressed in future research regarding the validity and reliability of pain intensity measurement.

One of the more consistent findings from pain assessment research is the finding that pain recall ratings tend to overestimate pain, although as discussed in the Introduction section, the amount of overestimation found across studies tends to vary and is usually quite negligible. The current findings are consistent with previous research in that all four recall measures (three individual ratings and one composite score) were higher than the actual pain intensity scores derived from the diaries. Also consistent with previous research, the individual least and average pain ratings, as well as for the composite measure of recalled characteristic pain, exhibited differences that were extremely small (all less than 4% of the range of the scale) and nonsignificant. Only the recalled worst pain rating was significantly higher than the actual worst rating, and even this difference was very small (around 8% of the range of the scale). These results are in line with results from other researchers who have found very small differences between recalled and actual pain [6–9] and indicate that 2-day recall ratings are generally accurate.

Because pain is a subjective experience, the extent to which patients are able to accurately rate previous pain depends on a large number of pain-related, the person, and environmental factors. Possible factors include the following: 1) the time period under consideration (with people probably being more accurate when the recall period is more recent [8]); 2) the peak and most recent pain experience [18]; 3) level of cognitive impairment [29]; and 4) the variability of pain during the period under question (with people being more accurate when they are experiencing pain at a consistent level [30]). However, even given these factors that can potentially impact accuracy, the current findings indicate that patients with chronic cancer are very accurate with respect to their ability to recall least and average pain over the past 2 days.

Further support for the validity of recall pain ratings in our sample came from the analyses demonstrating strong associations between recall measures and actual scores of least, worst, and average pain from the diaries. The correlation coefficients between the single 2-day recall ratings of average pain and the composite score reflecting characteristic pain and the averages of 2 day's worth of diary data were all greater than 0.70. Consistent with what would be predicted based on psychometric theory, the strongest associations for the recall measures were found between the composite recall score assessing characteristic pain and the composite scores derived from the diaries representing average pain; these coefficients were 0.78 in the first week and 0.84 in the second week of the study. This finding raises the possibility that such a measure—which has very little assessment burden and which appears to be highly reliable and valid—may prove to be an excellent choice for pain clinical trials. Research is needed to examine this possibility further, given the many benefits of such a measure if it proves to be valid in other pain populations.

The test–retest stability findings from this study do suggest the possibility that recall ratings might not prove to be as reliable as measures derived from diaries. Across all of the pain intensity domains, the composite score derived from the diaries demonstrated a higher level of test–retest stability than the recall ratings did. For the measures of least pain (0.75 vs 0.87), average pain (0.57 vs 0.83), and characteristic pain (0.73 vs 0.83), the differences were fairly substantial. On the other hand, the internal consistency of the measure of recalled characteristic pain (both Cronbach's αs = 0.90) was very high and similar to the internal consistency of the composite average pain score derived from the diaries (Cronbach's α = 0.91 and 0.86). These findings suggest that the composite recall measure may be as reliable as the average diary score. Additional tests comparing the reliabilities of these measures in other populations are needed to help determine their generalizability.

There are a number of limitations to the study that should be taken into account when interpreting the results. First, the study involved a relatively low sample size (of 52 individuals) and a unique sample (of patients with cancer who were mostly white and middle class). Moreover, data were collected as a part of a drug trial (and not during a time when they were drug-free). In addition, the current findings focused on 2-day recall ratings. Different results might have emerged had the recall period been longer than 2 days. Thus, even though the results were generally consistent with previous findings regarding the validity and reliability of recall ratings, we cannot assume that the results will necessarily generalize to other samples of individuals with chronic pain or for recall periods longer than 2 days. It is therefore important to replicate the analyses performed here in other samples and using a variety of recall periods to help determine the extent to which they generalize.

Also, the diaries used in this study to obtain multiple measures of current pain were paper-and-pencil diaries that were brought into the clinic visits on a weekly basis. We know that not all patients complete paper-and-pencil diaries when they are asked to [31], and we do not know how many of the ratings of current pain were completed at requested times. It is therefore not possible to know what effect, if any, this had on the validity and reliability coefficients obtained. Some of the problems associated with paper diaries (e.g., backfilling of pain ratings, checking of previous ratings, and lack of time-stamped information) can be addressed with the use of electronic diaries [28]. However, electronic diaries also have significant weaknesses, including lack of generalizability to populations who are unable or unwilling to use them and problems associated with imputation due to missing data [28]. However, given that the recall ratings examined in this study were obtained during the weekly clinic visits and not via diaries, potential limitations due to diary use have no impact on the results related to the stability and internal consistency coefficients of the recall ratings and measure of characteristic pain. Finally, although evidence suggests that reactive effects from the use of pain diaries are minor [32], it is possible that rating pain on a regular basis has the potential to influence the experience of pain and subsequent ratings.

Despite the limitations of this study, our findings regarding the validity and reliability of recall pain intensity ratings are consistent with what has been reported by other investigators and provide additional data supporting the potential utility of recall ratings in pain clinical trials. If these findings were to replicate in other samples of individuals with chronic pain, they could have a significant influence on the way that pain intensity is assessed in clinical trials. First, they would suggest that diary methods—electronic or otherwise—of obtaining multiple measures of current pain as a way to measure typical or average pain in a sample may not be needed in many situations. Their use could significantly reduce the assessment burden on patients participating in such trials and would also significantly reduce the costs given that it is much easier to obtain single measures than multiple measures. It could also substantially improve the external validity of research findings in that 1) studies could include participants who are unable or unwilling to complete multiple measures over time, thereby increasing the generalizability of findings, and 2) less missing data would result, increasing the validity of findings because researchers would not have to choose between the two evils of a) limiting analyses to only participants who provided complete data or b) imputing missing data, which results in at best only “estimates” of the data that are otherwise missing. These potential benefits indicate that additional examinations comparing the psychometric properties of different pain assessment strategies are warranted.

Note

1
“Actual” is in quotes here because it is extremely unlikely that the least and worst pain levels experienced in a 24-hour period were occurring at one of the diary assessment points. Thus, “actual” worst and least pain intensity scores derived from diaries likely underrepresent and overrepresent the worst and least pain experienced, respectively.

References

1
Goldman
RH
Stason
WB
Park
SK
et al
Low-dose amitriptyline for treatment of persistent arm pain due to repetitive use
.
Pain
 
2010
;
149
:
117
23
.
2
Litt
MD
Shafer
DM
Kreutzer
DL
.
Brief cognitive-behavioral treatment for TMD pain: Long-term outcomes and moderators of treatment
.
Pain
 
2010
;
15
:
110
6
.
3
Skljarevski
V
Zhang
S
Desaiah
D
et al
Duloxetine versus placebo in patients with chronic low back pain: A 12-week, fixed-dose, randomized, double-blind trial
.
J Pain
 
2010
;
11
:
1282
90
.
4
Broderick
JE
Stone
AA
Calvanese
P
Schwartz
JE
Turk
DC
.
Recalled pain ratings: A complex and poorly defined task
.
J Pain
 
2006
;
7
:
142
9
.
5
Stone
AA
Broderick
JE
Shiffman
SS
Schwartz
JE
.
Understanding recall of weekly pain from a momentary assessment perspective: Absolute agreement, between- and within-person consistency, and judged change in weekly pain
.
Pain
 
2004
;
107
:
61
9
.
6
Bolton
JE
Humphreys
BK
van Hedel
HJ
.
Validity of weekly recall ratings of average pain intensity in neck pain patients
.
J Manip Physiol Ther
 
2010
;
33
:
612
7
.
7
Broderick
JE
Schwartz
JE
Schneider
S
Stone
AA
.
Can end-of-day reports replace momentary assessment of pain and fatigue?
J Pain
 
2009
;
10
:
274
81
.
8
Broderick
JE
Schwartz
JE
Vikingstad
G
et al
The accuracy of pain and fatigue items across different reporting periods
.
Pain
 
2008
;
139
:
146
57
.
9
Salovey
P
Smith
AF
Turk
DC
Jobe
JB
Willis
GB
.
The accuracy of memory for pain: Not so bad most of the time
.
APS J
 
1993
;
2
:
184
91
.
10
Hagg
O
Fritzell
P
Nordwall
A
.
The clinical importance of changes in outcome scores after treatment for chronic low back pain
.
Eur Spine J
 
2003
;
12
:
12
20
.
11
Kikuchi
H
Yoshiuchi
K
Miyasaka
N
et al
Reliability of recalled self-report on headache intensity: Investigation using ecological momentary assessment technique
.
Cephalalgia
 
2006
;
26
:
1335
43
.
12
Marty
M
Rozenberg
S
Legout
V
et al
Influence of time, activities, and memory on the assessment of chronic low back pain intensity
.
Spine (Phila Pa 1976)
 
2009
;
34
:
1604
9
.
13
Grilo
RM
Treves
R
Preux
PM
Vergne-Salle
P
Bertin
P
.
Clinically relevant VAS pain score change in patients with acute rheumatic conditions
.
Joint Bone Spine
 
2007
;
74
:
358
61
.
14
Bolton
JE
.
Accuracy of recall of usual pain intensity in back pain patients
.
Pain
 
1999
;
83
:
533
9
.
15
Jamison
RN
Raymond
SA
Levine
JG
et al
Electronic diaries for monitoring chronic pain: 1-year validation study
.
Pain
 
2001
;
91
:
277
85
.
16
Jamison
RN
Raymond
SA
Slawsby
EA
McHugo
GJ
Baird
JC
.
Pain assessment in patients with low back pain: Comparison of weekly recall and momentary electronic data
.
J Pain
 
2006
;
7
:
192
9
.
17
Jamison
RN
Sbrocco
T
Parris
WC
.
The influence of physical and psychosocial factors on accuracy of memory for pain in chronic pain patients
.
Pain
 
1989
;
37
:
289
94
.
18
Jensen
MP
Mardekian
J
Lakshminarayanan
M
Boye
ME
.
Validity of 24-h recall ratings of pain severity: Biasing effects of “Peak” and “End” pain
.
Pain
 
2008
;
137
:
422
7
.
19
Jensen
MP
Turner
LR
Turner
JA
Romano
JM
.
The use of multiple-item scales for pain intensity measurement in chronic pain patients
.
Pain
 
1996
;
67
:
35
40
.
20
Stone
AA
Broderick
JE
Kaell
AT
DelesPaul
PA
Porter
LE
.
Does the peak-end phenomenon observed in laboratory pain studies apply to real-world pain in rheumatoid arthritics?
J Pain
 
2000
;
1
:
212
7
.
21
Stone
AA
Broderick
JE
Schwartz
JE
.
Validity of average, minimum, and maximum end-of-day recall assessments of pain and fatigue
.
Contemp Clin Trials
 
2010
;
31
:
483
90
.
22
American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational and Psychological Testing (U.S.)
.
Standards for Educational and Psychological Testing
 .
Washington, DC
:
American Educational Research Association
;
1999
.
23
Nunnally
JC
Bernstein
IH
.
Psychometric Theory
 .
New York
:
McGraw-Hill
;
1994
.
24
Jensen
MP
McFarland
CA
.
Increasing the reliability and validity of pain intensity measurement in chronic pain patients
.
Pain
 
1993
;
55
:
195
203
.
25
Cleeland
CS
Ryan
KM
.
Pain assessment: Global use of the Brief Pain Inventory
.
Ann Acad Med Singapore
 
1994
;
23
:
129
38
.
26
Von Korff
M
Ormel
J
Keefe
FJ
Dworkin
SF
.
Grading the severity of chronic pain
.
Pain
 
1992
;
50
:
133
49
.
27
Sloan
P
Slatkin
N
Ahdieh
H
.
Effectiveness and safety of oral extended-release oxymorphone for the treatment of cancer pain: A pilot study
.
Support Care Cancer
 
2005
;
13
:
57
65
.
28
Jensen
MP
Karoly
P
.
Self-report scales and procedures for assessing pain in adults
. In:
Turk
DC
Melzack
R
, eds.
Handbook of Pain Assessment
 ,
3rd
edition.
New York
:
Guildford Press
;
2011
:
19
44
.
29
Chibnall
JT
Tait
RC
.
Pain assessment in cognitive impaired and unimpaired older adults: A comparison of four scales
.
Pain
 
2001
;
92
:
173
86
.
30
Stone
AR
Schwartz
JE
Broderick
JE
Shiffman
SS
.
Variability of momentary pain predicts recall of weekly pain: A consequence of the peak (or salience) memory heuristic
.
Pers Soc Psychol B
 
2005
;
31
:
1340
6
.
31
Stone
AA
Shiffman
S
Schwartz
JE
Broderick
JE
Hufford
MR
.
Patient compliance with paper and electronic diaries
.
Control Clin Trials
 
2003
;
24
:
182
99
.
32
Aaron
LA
Turner
JA
Mancl
L
Brister
H
Sawchuk
CN
.
Electronic diary assessment of pain-related variables: Is reactivity a problem?
J Pain
 
2005
;
6
:
107
15
.
Disclosure: The original trial and current analyses were funded by Endo Pharmaceuticals, Inc. Errol M. Gould and Susan L. Potts were employees of Endo Pharmaceuticals, Inc. and Wei Wang was an intern for Endo Pharmaceuticals, Inc. when the analyses for this study were performed. Mark P. Jensen has served as a consultant to Endo Pharmaceuticals, Inc. and has received consulting fees from RTI Health Solutions, Covidien, Bristol-Myers Squibb, Schwartz Biosciences, Depomed, Eli Lilly, Pfizer, Merck, and Smith & Nephew within the past 36 months.