Reliability of patient-reported outcomes in rheumatoid arthritis patients : an observational prospective study

Objective. Patient-reported outcomes (PROs) such as pain, patient global assessment (PGA) and fatigue are regularly assessed in RA patients. In the present study, we aimed to explore the reliability and smallest detectable differences (SDDs) of these PROs, and whether the time between assessments has an impact on reliability. Methods. Forty RA patients on stable treatment reported the three PROs daily over two subsequent months. We assessed the reliability of these measures by calculating intraclass correlation coefficients (ICCs) and the SDDs for 1-, 7-, 14and 28-day test retest intervals. Results. Overall, SDD and ICC were 25 mm and 0.67 for pain, 25 mm and 0.71 for PGA and 30 mm and 0.66 for fatigue, respectively. SDD was higher with longer time period between assessments, ranging from 19 mm (1-day intervals) to 30 mm (28-day intervals) for pain, 19 to 33 mm for PGA, and 26 to 34 mm for fatigue; correspondingly, ICC was smaller with longer intervals, and ranged between the 1and the 28-day interval from 0.80 to 0.50 for pain, 0.83 to 0.57 for PGA and 0.76 to 0.58 for fatigue. The baseline simplified disease activity index did not have any influence on reliability. Lower baseline PRO scores led to smaller SDDs. Conclusion. Reliability of pain, PGA and fatigue measurements is dependent on the tested time interval and the baseline levels. The relatively high SDDs, even for patients in the lowest tertiles of their PROs, indicate potential issues for assessment of the presence of remission.


Introduction
Patient-reported outcomes (PROs) are widely employed in clinical studies of RA patients as well as clinical practice, because they are an important part of the assessment of disease activity and response to therapy [13].Key PRO domains include pain, patient global assessment (PGA) of disease activity and fatigue.Typically, all are evaluated on a 100-mm visual analogue scale (VAS) [46].
Several concepts are used to characterize these measures for use and interpretation in clinical practice.To map changes on scales to the perception of the patient, the minimal clinically important difference is used [7].Several studies addressed the determination of minimal clinically important difference in VAS measurements of PROs in various rheumatic diseases [8], as it conveys important information when evaluating response to treatment.
Fewer data exist on the so-called reliability of an instrument, which characterizes its stability or reproducibility in a testretest setting [9,10].Thus, reliability needs to be determined during stable disease and personal and environmental factors [11].Statistically, the intraclass correlation (ICC) coefficient and the smallest detectable difference (SDD) can be calculated as relative and absolute measures of reliability [1012].
In RA, for pain and PGA, moderate to poor reliability was found for both measures in a testretest setting, with relatively large SDDs [13,14].Moreover, pain scores contribute strongly to the patient's estimation of disease activity [15,16].Fatigue is a serious symptom of chronic musculoskeletal diseases and is associated with worse functional outcomes, but a proper evaluation of its SDD has not yet been performed [17,18].This may become clinically highly relevant if treatment decisions are based on instruments that include these measures, such as when following a treat-to-target approach in RA [19] using disease activity indices or the provisional ACR/ EULAR remission criteria as the target.All of these comprise the PGA [20], and it is therefore very relevant to understand which variability is underlying this particular measure [21,22].Here, we aimed to determine the ICC and SDD for changes in pain, PGA and fatigue, and to investigate how these properties change over increasing periods of time between two assessments, and how they are influenced if patients have higher or lower initial measurements.

Study design
Forty consecutive patients from routine clinical care classified as having RA by the ACR 1987 revised criteria [23] or the ACR/EULAR 2010 criteria [24] were randomized by a computerized allocation programme into two different groups.All patients visited our clinic at baseline, at day 28 (follow-up visit 1) and at day 56 (follow-up visit 2).Between these assessments, one group of 20 patients kept daily records of their pain, fatigue and PGA in a diary; they had been encouraged to perform the assessments at the same time every day.The other 20 patients were called daily by telephone at random time points (between 8 a.m. and 4 p.m.), and asked to assess these VASs and to report the results to the study team during the call.At the time of the three clinic visits, pain, fatigue and PGA scores, as well as all other core set variables of disease activity, were obtained from each patient.During these 8 weeks, patients remained on stable treatment with DMARDs, glucocorticoids and NSAIDs.
PGA was assessed on a 100-mm VAS, using no disease activity and highly active disease as anchors.The wording of the question was: how do you estimate your disease activity today?(originally in German: Wie scha ¨tzen Sie heute Ihre Krankheitsaktivita ¨t ein?).Pain was evaluated on a 100-mm VAS, responding to the question: how severe is your pain today?(originally in German: Wie stark sind Ihre Schmerzen heute?), using no pain and unbearable pain as anchors.Fatigue was also assessed on a 100-mm VAS, asking: how strong was your fatigue today (originally in German: Wie stark war Ihre Mu ¨digkeit heute?), using no fatigue at all and worst imaginable fatigue as anchors.The study patients have not been trained specially on how to answer the three questions; thus, we provided the same information as is provided to any other patient in routine care.There the questionnaire is handed out by a health professional, who briefly explains the use of a VAS and points out the respective anchors.The patient fills out the questionnaire while waiting for the physician.The ethic committee of the Medical University Vienna approved the study, and written consent was obtained according to the Declaration of Helsinki from all patients.

Reliability analyses using ICC and SDD of PGA, pain and fatigue
The ICC can be used to assess the reliability of two or more measurements and results as a value between 0 and 1.An ICC of 1 means that 100% of the variability in the measurements is due to differences between patients (i.e.no error, no within-patient variability: perfect reliability), while an ICC of 0 means that all variability is related to within-patient variability and error.This is based on a very generic formula dividing the true variance (within patient variability) by the observed variance (the total variance) [2528].
In contrast to the ICC as a relative measure of reliability, the SDD provides a cut-off value for the smallest amount of difference that is needed to reliably distinguish true change from measurement error [8,9,29].The SDD is calculated by multiplying the S.D. of the difference between two assessments by 1.96.Subtracting or adding the SDD to the mean difference is known as the limits of agreement, as described by Bland and Altman [29].
We also calculated standardized response means for PGA, pain and fatigue to assess whether a change over time that is greater than random has occurred [30].
Assessing differences in variability between the telephone and the diary group Baseline characteristics of the two groups were compared by parametric or non-parametric tests, as appropriate.The course of PGA, pain and fatigue levels were analysed separately for each patient.We calculated ICC and SDD of PGA, pain and fatigue for the various testretest intervals, separately for the diary and the telephone groups.Thus, we evaluated whether or not the method of obtaining repeated measurements by the patients (diary or telephone report) was an important determinant of results and, consequently, if they could be used jointly for further analyses.

Reliability for increasing intervals between two assessments
To investigate how reliable measurements remain as measurement intervals increase, we calculated ICC for pain, PGA and fatigue separately for 1-, 7-, 14-and 28-day testretest intervals.In other words, given the 56 days of repeated assessments, we calculated 55 ICCs for the 1-day testretest interval (days 12, 23, etc.), 50 ICCs for the 7-day interval, 43 ICCs for the 14-day interval and 29 ICCs for the 28-day interval.
ICC in this testretest setting was calculated with a twoway mixed design because all patients were evaluated, and the VAS assessments were performed on consecutive days and thus were not purely at random; this is based on the model ICC by Shrout and Fleiss (1979) [31].In our case, we further assumed an absolute agreement between assessment days, meaning that the model was not adjusted for differences in mean score between days [12,31].For the calculations of the SDDs for the three PROs, we proceeded in an analogous way.

Reliability of measurements in patients with different baseline levels of disease activity
To be able to discriminate between patients with higher or lower variability/reliability of PROs, we calculated ICC and SDD separately for certain subgroups of patients.We grouped patients by forming tertiles according to their baseline simplified disease activity index (SDAI), their baseline value of PGA, their baseline pain level and their baseline fatigue level.Further, we divided patients according to whether they had 410 mm PGA at baseline or >10 mm PGA.Then we calculated SDDs separately for those groups.SDDs and ICCs were again calculated for 1-, 7-, 14-and 28-day testretest intervals.In a sensitivity analysis, we excluded the top 10% of patients with the largest changes in the SDAI (improvement and deterioration) during the 2-month study period.Thus, we repeated the above-described analyses in the remaining 80% of patients with more stable disease.

Patient characteristics
Forty RA patients {85% female, 60% RF positive, median SDAI: 13.4 [interquartile range (IQR) 6.520.4],median disease duration 9.5 years (IQR 5.014.8),Table 1} participated in this study.Despite randomization, there were some numerical (though not statistically significant) differences in baseline and follow-up disease characteristics in pain, PGA and fatigue between the telephone and diary groups.Medication remained stable over the study period, and NSAID use was balanced in each group (75% of patients in each group used NSAIDs); even minor changes in disease activity, as reflected by an SDAI 50% response [32], were found only in 18% of the patients at the first follow-up visit and in 25% at the second follow-up visit.Over the 2-month period, the top 10% of patients in terms of worsening had an SDAI change of 8.3 or more, and the 10% of the patients who improved had a change of À8.9 or less; the median change in SDAI was À0.9 (IQR À2.3 to 2.9).Concerning the PROs, the standardized response means for the various testretest intervals ranged between À0.012 and 0.029 for pain, between À0.004 and 0.053 for PGA and between À0.06 and 0.01 for fatigue, thus supporting the notion of an absence of true change.Since PGA is an integral part of the SDAI, we also calculated the SDD of the SDAI.For the first month interval the SDD was 10.73, and for the second month interval it was 12.67, which is a mean SDD for the SDAI of 11.7 (S.D. 1.37).Reliability and SDD and the influence of the length of assessment intervals The overall ICCs for pain, PGA and fatigue in the 1-day/7day testretest interval were 0.8/0.67,0.83/0.71and 0.76/ 0.66, respectively (Table 2).Correspondingly, the overall SDDs in millimetres were 18.8/24.5,19/25 and 25.9/30.2(Table 3).As expected, higher reliability according to ICC coincided with smaller SDD and vice versa.
Comparing the ICC and the SDD across the various assessment intervals, there was a significant trend towards lower ICC and higher SDD with longer intervals for all three measures (Tables 2 and 3).The SDDs of pain and PGA differed by the same amount between the 1-day testretest and the 7-day testretest intervals.This corresponds to an increase of 6 mm from 19 mm (S.D. 4.7) to 25 mm (S.D. 4.7) for pain and from 19 mm (S.D. 4.7) to 25 mm (S.D. 6) for PGA.The same difference was then observed between the 7-and 28-day testretest intervals [increase by further 530 mm (S.D. 5.8) for pain and 30 mm (S.D. 6.1) for PGA].SDDs for fatigue were generally higher, starting with 25.9 mm (S.D. 5.7) for the 1-day interval, which increased by 4.3 mm at the 7-day testretest interval.The difference in SDDs of 3.7 mm between the 7-and the 28-day testretest intervals was again about the same as between the 1-and 7-day intervals.The differences between ICCs and between SDDs comparing the 28day interval with the 1-day testretest intervals were 0.30 and 10.8 mm, respectively, for pain, 0.27 and 10.9 mm for PGA and 0.17 and 8.1 mm for fatigue.

Differences between telephone and diary groups
The reliability as expressed by the ICC for the 1-day testretest intervals was very similar in both the diary and the telephone group [0.78 (S.D. 0.11)/0.80(S.D. 0.10) for pain; 0.82 (S.D. 0.09)/0.83(S.D. 0.10) for PGA; 0.77 (S.D. 0.11)/0.72(S.D. 0.13) for fatigue ].Differences between the separately calculated SDDs for each group over the various testretest intervals were only 13 mm for pain, 0.94.6 mm for PGA and 0.12.5 mm for fatigue (detailed data not shown).The least differences in SDDs were found in fatigue and in overall PROs for the 28-day testretest interval.The method of obtaining the data (self-report or telephone call) did not appear to have a strong influence on the 1-day reliability, since differences of <5 mm (corresponding to 5% of the scale range) cannot be considered of significance; thus, we pooled the data for the remaining analyses.

Factors associated with high or low reliability and the SDD
Reliability analyses calculated separately by baseline SDAI tertiles (range: 38; >817; >1738) did not reveal any differences that could be used to differentiate between more and less reliable patients concerning their PRO reporting (data not shown).When SDDs were calculated separately for the baseline tertiles of each of the PROs (Table 3), the heterogeneity of differently scored VAS in patients could be reduced, and in the case of PGA, higher baseline values could be shown to be associated with higher SDDs.For pain, SDDs were also lowest in the lower tertile, although no trend was observable across the three tertiles.For fatigue, SDDs differed between the lowest and both other baseline tertiles, with lowest SDDs in patients having a fatigue score of <9 mm and highest SDDs in those with >41 mm.The trend that longer testretest intervals led to bigger SDDs and smaller ICCs was continued when patients were divided into smaller subgroups (Tables 2 and 3).ICCs were higher for patients with higher PROs, but this does not adequately reflect reliability, because the ranges of scores were bigger in the higher tertiles.Supplementary Table S1, available at Rheumatology Online, presents the results obtained when excluding the top 10% with respect to worsening in SDAI and the top 10% with respect to improvement over 2 months.These results were similar to those of the main analyses.SDDs of patients who had a baseline PGA of 410 mm (n = 12; remission cutoff point according to the recent ACR/EULAR remission criteria [20]) were significantly smaller than those of other patients: 15 mm (S.D. 8.3) vs 20 mm (S.D. 5.1) (using the 1-day testretest interval); this increased to 22.3 mm (S.D. 9.3) vs 32.2 mm (S.D. 7.2) in the 28-day interval (Table 4).

Discussion
This study provides cut-offs for true change in pain and PGA in a representative population of RA patients.Since day-to-day variations in VAS have not to date been explored in depth, we designed this study asking patients to document their pain and PGA levels daily over a period of 56 days.The reliability of pain, PGA and fatigue measurements decreased with longer time intervals, although there was no true change that could actually dilute the measurement and violate reliability assumptions (i.e.no true change between the two assessment times); thus, most variability between the measurements is explainable by within-patient variability and measurement error [28].Among the three measures assessed here, the PGA is clearly the most relevant for RA disease activity assessment; this is especially so because of its inclusion in RA disease activity composite indices.However, although mostly not directly used for activity assessment, pain and fatigue influence the patient's estimation of disease activity [15,33], and therefore they were also included in this study.We show here that cut-offs to distinguish true change from noise for measures of PGA, pain and fatigue differ when comparing different time intervals.SDDs for pain and PGA were very similar, but higher SDDs were found for fatigue.The 1-day testretest cut-off was 19 mm for both pain and PGA.Indeed, reliability seemed  to progressively decrease from the 1-to the 7-, 14-and 28-day intervals; this was also the case for fatigue, although reliability was somewhat lower.As SDDs do not change much with longer testretest intervals, a putative threshold value of 25 mm for both pain and PGA could potentially also be valid for even longer intervals of 23 months, which represent the typical outpatient visit schedules of RA patients.Fatigue especially seems to show more variability over time, resulting in lower reliability and higher SDDs.Considering that baseline fatigue scores were heterogeneous (ranging from 1 to 100 mm), based on our analysis, an overall SDD threshold of 30 mm seems to be applicable.
A study in RA patients testing a 7-day interval reported 26 mm for PGA and 22 mm for pain as SDD [14].Testretest reliability was examined in other studies, providing reliability coefficients ranging between 0.7 and 0.93.Retest evaluation was mostly done a few hours after the initial evaluation [34,35].However, a reevaluation only a few hours later may have a very high recall bias.Lassere et al. [13] reported a SDD for pain of 27 mm for a 1-day testretest interval and 49 mm for a 7day testretest interval; the SDD for PGA was 37 mm for a 7-day interval.The ICC for all testretest intervals and for PGA and pain (which was tested in 24 patients for the 1day interval and 26 patients for the 7-day interval) was 0.75.
All former studies tested specific intervals, thus reporting point estimates for true change and showing no spread.The strength of our study is that we have multiple evaluations of the same interval, assuming that a 1-day interval between the first and the second day contains the same inherent error as, for example, the interval from the eighth to the ninth day.It can be seen in Table 3 that SDDs of the same time interval show a spread, which we then summarized to one SDD.As a second point, patients were on stable treatment, and no interventions coincided with any study visit, supported by the fact that no change in PROs could be seen (assessed via standardized response means), which is the foundation of a proper evaluation of reliability.Another principle of reliability studies is to use a time interval that is neither too long nor too short.In some ways we have intentionally violated this rule, because we wanted to investigate whether this holds true, and different intervals indeed lead to different results [27].A limitation in our study could be that there is a dependency in the data, when using day 2 twice for calculating two testretest intervals, comparing it with day 1 on the one hand and with day 3 on the other.Patients can also get used to daily assessments, and the diary group in particular was not blinded to their previous scores, so that reinforcement over the 56-day period may have taken place [36,37].Furthermore, co-morbidities might influence the reliability of or fluctuations in these VAS scores.For example, patients with secondary FM, OA or low back pain experience pain and limitations in daily life, and it can be difficult for the patient to differentiate symptoms from these as opposed to symptoms caused by RA.Although co-morbidities were not formally assessed in our study, their presence must be assumed-at least to some extent-also among our study population [38].
An important aspect in interpreting the ICC is the variation in scores between individual patients.Overall, our patients were rather heterogeneous, resulting in higher ICCs [37], that is, they represent a wide range of individuals with RA.We explored this aspect when we calculated the reliability measures for tertiles of baseline PROs.Lower SDDs were found for patients in the lowest subgroup (VAS at baseline ranging between 0 and 11 mm), compared with higher subgroups.The most interesting subgroup in this respect were patients with a PGA 4 10 mm, since this value constitutes the cut-off point for the Boolean-based ACREULAR remission definition [20].Here, a 13-to 22-mm change set the threshold for true change in PGA.For the other patients, SDDs of 2033 mm of change are needed, in line with our findings for the total cohort.As the patient global criterion seems to be an important limiting factor for fulfilment of remission criteria [3942], in particular of the Boolean-based ACREULAR remission criteria, the nature of its variability is important.Thus, patients in remission on stable treatment who evaluate their PGA at 2 cm on one clinical visit might further be regarded as being in remission if no other deviation of disease activity is noticeable.
In conclusion, the results of this study suggest that, in stable RA patients, a 25-mm change on the VAS for pain or PGA and a 30-mm change for fatigue may identify true change; however, this is clearly dependent on (and can be refined based on) the starting measurement level.It is also apparent that in patients who are assessed less frequently, the evaluation of measurement differences as indicating changes is more difficult.After identifying what comprises a true change, in clinical practice, of course, the next important step will be to determine whether the change is clinically relevant.All this in fact speaks for a more global interpretation of disease activity encompassing both patient-and physician-based measures, which in their totality can be a good estimate of a patient's true disease activity.

TABLE 1
Baseline characteristics of total patient group and separately for patients assessed by telephone or by diary Median (interquartile range), unless indicated otherwise.

TABLE 2
Summary statistics of intraclass correlation coefficients of PGA, pain and fatigue Intraclass correlation coefficient (ICC) is calculated totally for all patients (total) and for patients divided into tertiles of the respective baseline values of the PRO (lowest, middle, highest).Separately calculated for 1-, 7-, 14-and 28-day testretest intervals.PGA: patient global assessment; PRO: patient-reported outcome.

TABLE 3
Summary of baseline values of PROs and of the smallest detectable differences in PGA pain and fatigue

TABLE 4
Summary statistics of smallest detectable differences in PGA calculated separately for patients with baseline PGA 4 10 mm and baseline PGA > 10 mm Smallest detectable differences (in mm) are calculated separately for 1-, 7-, 14-and 28-day testretest intervals.PGA: patient global assessment.