Assessing fatigue in adults with axial spondyloarthritis: a systematic review of the quality and acceptability of patient-reported outcome measures

Abstract Objective The aim was to evaluate the quality and acceptability of patient-reported outcome measures used to assess fatigue in patients with axial spondyloarthritis. Methods A two-stage systematic review of major electronic databases (1980–2017) was carried out to: (i) identify measures; and (ii) identify evaluative studies. Study and measurement quality were evaluated following international standards. Measurement content was appraised against a conceptual model of RA-fatigue. Results From 387 reviewed abstracts, 23 articles provided evidence for nine fatigue-specific measures: 6 multi-item and 3 single-item. No axial spondyloarthritis-fatigue-specific measure was identified. Evidence of reliability was limited, but acceptable for the Multi-dimensional Fatigue Inventory (internal consistency, test–retest) and Short Form 36-item Health Survey Vitality subscale (SF-36 VT; internal consistency). Evidence of construct validity was moderate for the Functional Assessment of Chronic Illness Therapy-Fatigue and 10 cm visual analog scale, limited for the SF-36 VT and not available for the remaining measures. Responsiveness was rarely evaluated. Evidence of measurement error, content validity or structural validity was not identified. Most measures provide a limited reflection of fatigue; the most comprehensive were the Multi-dimensional Assessment of Fatigue, Multi-dimensional Fatigue Inventory-20, Functional Assessment of Chronic Illness Therapy-fatigue and Fatigue Severity Scale. Conclusion The limited content and often poor quality of the reviewed measures limit any clear recommendation for fatigue assessment in this population; assessments should be applied with caution until further robust evidence is established. Well-developed, patient-derived measures can provide essential evidence of the patient’s perspective to inform clinical research and drive tailored health care. The collaborative engagement of key stakeholders must seek to ensure that future fatigue assessment is relevant, acceptable and of high quality.


Introduction
Pain, stiffness, reduced mobility and fatigue are cardinal features of axial spondyloarthritis (axSpA), including AS [1]. However, despite the importance afforded to fatigue by patients [2,3], fatigue severity was added to international assessment guidance for axSpA only in 2009 [4]. Accordingly, fatigue assessment in axSpA clinical trials increased significantly from a mere 17.1% of trials completed pre-2001 to 84% post-2001 [5], with most trials (84%) using the single fatigue-severity visual analog scale (VAS) recommended in the assessment guidance [6]. A recent conceptualization of fatigue in RA demonstrated the multifaceted and often complex relationships between disease-specific, cognitive/behavioural (behaviour, cognitive, emotion) and personal (support, health, environment, responsibilities) factors [7]; a complexity that might not be readily captured with a single item of severity [8]. Moreover, individuals experiencing significant impairment owing to frequent, but not severe (VAS scores <5), fatigue would not be identified if assessment were informed purely by fatigue severity [8]. Patients' fatigue experience may, therefore, be better captured with multi-item, multidomain patient-reported outcome measures (PROMs), providing a structured, patient-reported assessment of health [9,10]. These may be generic, containing items reflecting general health and completed by any population, or specific to a condition (e.g. axSpA), an aspect of health (e.g. fatigue) or a population (e.g. children). A scoping review of fatigue measures used in rheumatology listed >12 multi-item measures, but only one rheumatologyspecific, multi-item measure [11], the Bristol RA Fatigue Multi-Dimensional Questionnaire [12,13]. However, the quality, acceptability and relevance of measures was not explored, thus limiting evidencebased recommendations.
This review will systematically appraise, compare and synthesize published evidence of the quality and acceptability of clearly defined single-and multi-item PROMs used in fatigue assessment in axSpA to establish the quality and acceptability of fatigue measures. The review will provide a transparent assessment of the evidence with which to inform PROM selection for future application in axSpA research and clinical practice.

Identification of studies and PROMs
Medical subject headings and free text searching reflected: (i) population: axSpA/AS; (ii) construct: fatigue; (iii) assessment type: PROMs; and (iv) measurement and practical properties [14]. Five databases were searched: Medline (OVID), Embase (OVID), PsycINFO (OVID), Cumulative Index of Nursing and Allied Health Literature and Web of Science; from January 1980 to August 2017. A second search used the names of identified measures: (i) population; (ii) construct; (iii) named measures; and (iv) measurement properties (supplementary Appendix S1). Reference lists of included studies and existing reviews were reviewed [11,15].

Eligibility criteria
One author (N.A.P.) assessed all titles and abstracts; agreement was independently checked on a 10% subset by a second author (K.L.H.). A third author (J.C.P.) double-assessed all abstracts relating to PsA. Any conflicts were resolved through discussion.

Study inclusion
Studies were included if they contained a clearly identifiable and reproducible patient-reported assessment of fatigue, reported evidence of development and/or evaluation after completion by axSpA patients, and were written in English. Studies were excluded if they were available only as abstracts, fatigue assessment was not patient reported, clearly identifiable or reproducible, or the study described PROM application only.

PROM inclusion
PROMs were included if they were fatigue specific, assessed fatigue as a separate domain within a multidomain measure, or were single or multi-item assessments. Clinician-reported assessments were excluded.

Data extraction and appraisal
Data extraction was informed by earlier published reviews [16][17][18][19], and the COnsensus-based Standards for the selection of health status Measurement INstruments (COSMIN) checklist [20][21][22]. Study and PROM-specific information was extracted. Evidence of measurement properties included: validity, reliability, responsiveness and interpretability (supplementary Appendix S2, available at Rheumatology Advances in Practice online). Practical properties included evidence of feasibility (administration time; scoring) and acceptability (patient relevance). Evidence of fatigue conceptualization and information pertaining to patient involvement was extracted and recorded. The RA-fatigue conceptual model [7] informed a comparative appraisal of PROM item content. One reviewer (N.A.P.) completed all data extraction. A 10% subset was independently double-extracted (K.L.H.) and agreement checked.

Assessment of study methodological quality
The COSMIN four-point checklist informed an assessment of study methodological quality for each reported measurement property: poor, fair, good or excellent [20][21][22]. The lowest item rating per measurement property informed the overall score.
Assessment of PROM quality A synthesis of recommendations described by others [18,19,23] facilitated the transparent appraisal of PROM quality. Measurement properties were appraised and rated accordingly: adequate (þ); inadequate (À); conflicting (6) or unclear (?) (supplementary Appendix 2, available at Rheumatology Advances in Practice online).

Data synthesis and PROM recommendation
Four factors informed the synthesis: (i) study methodological quality (COSMIN); (ii) number of studies reporting evidence; (iii) ratings for measurement/practical properties per measure; and (iv) consistency of results between studies [16,18]. The final synthesis, hence the evidence upon which PROM recommendation will be made, reflects both: (i) the quality of each measurement property: adequate (þ), not adequate (À), conflicting (6) or unclear (?) (supplementary Appendix S2, available at Rheumatology Advances in Practice online); and (ii) the overall level of evidence for each measurement property: 'strong-consistent findings in multiple studies of good methodological quality OR in one study of excellent quality', 'moderate-consistent findings in multiple studies of fair methodological quality OR in one study of good quality', 'limited-one study of fair methodological quality', 'conflicting-conflicting findings' or 'unknownonly studies of poor methodological quality' [18]. PROM recommendations will consider: (i) the extent to which key domains of fatigue identified in the RA-fatigue model are reflected in the PROM (content validity); (ii) whether there is adequate evidence, minimally, of measurement validity (structural and construct) and reliability (internal consistency and test-retest); and (iii) an evidence base that is judged, as a minimum, to be moderate.

Study and sample characteristics
All studies included adults with a primary diagnosis of axSpA, aged between 18 and 72 years old (supplementary Appendix S3, available at Rheumatology Advances in Practice online). Sample sizes ranged from 40 to 812. Studies were predominantly cross-sectional, investigating fatigue prevalence and/or its association with other variables.

Measurement properties and methodological quality
Study methodological quality (per PROM) was assessed and recorded (supplementary Appendix S4, available at Rheumatology Advances in Practice online). An evidence synthesis is presented in Table 2. Evidence of measurement error, content or structural validity, criterion-based responsiveness, acceptability or feasibility of completion was not identified.

Fatigue conceptualization and patient involvement
A review of PROM development suggests very limited conceptualization of fatigue for four PROMs (MFI-20, MFSI-SF, SF-36 and BFI; Table 1). Item generation or selection was often poorly reported and lacking in transparency. Only the single-item VAS of fatigue severity (taken from the BASDAI) was developed specifically for use with axSpA patients, but a conceptualization of fatigue was absent. The involvement of patients did not extend beyond participation (i.e. simply measurement completion); no study included patients as research partners in measurement evaluation.

Comparative item content
Although similarities of item content exist, all reviewed measures provided a limited reflection of the RA-fatigue model (Table 3). All single-item measures assessed fatigue severity.
Multidimensional fatigue-specific PROMs MAF [24] Six poor-quality studies provided limited evidence of construct validity (correlations and known-groups validity), including small to moderate associations between the MAF total and AS-specific Bath measures (range 0.23-0.73), and the MAF subscales and SF-36 VT (range 0.3-0.53) and 10 cm fatigue-severity VAS (range 0.39-0.53) [33][34][35][36][37][38]; all evaluations lacked a priori hypothesized associations. MFI-20 [25] One poor-quality study provided limited evidence of construct validity [39]. A fair-quality study provided acceptable evidence of internal consistency {Cronbach's a from 0.68 [Reduced Motivation (RM) subscale] to 0.86 [Reduced Activity (RA) subscale]} and construct validity [40] {moderate to strong associations between subscales [general fatigue with physical fatigue (PF) 0.69/RA 0.52/RM 0.45/mental fatigue (MF) 0.45; MF with PF 0.40/RA 0.42/RM 0.48; RM with PF 0.51/RA 0.54] supporting assumed a priori hypothesis associations} [40]. Limited evidence for 1-week test-retest reliability was also reported for patients after completion of a VAS on a person's overall perceived health, taken from the EuroQoL (EQ-5D) (intraclass correlation coefficient, ICC range: PF 0.57-0.75 RM/MF) in a study judged to be of fair quality [39]; for three subscales (GF, PF and RA) values <0.70 were reported. Distribution-based measures of responsiveness [both effect size (ES) statistics and the standardized response mean (SRM)] were calculated from trial data, without any a priori hypotheses, following 3-month completion after the end of spa therapy: small values (<0.       Overall quality: There was no measurement evidence available for the following measurement properties, and they are therefore not referred to in the synthesis   Multi-dimensional Fatigue Symptom Inventory-Short Form (MFSI-SF) [26] One poor-quality study provided limited evidence of construct validity [41]. Weak to strong associations between the MFSI-SF subscales and the BASDAI 10 cm VAS were reported (10 cm VAS with GF 0.71/PF 0.74/ emotional fatigue 0.56/MF 0.45/Vigor À0.32) after completion by 62 AS patients. Although association between variables could be assumed, a priori hypothesized associations were not stated.
FSS [28] Both strong (0.77) [43] and moderate (0.53) [44] associations between the FFS and the 10 cm fatigue-severity VAS have been reported in two studies judged to be of poor quality. Small ES were reported at 28 days for participants in both arms of a placebo-controlled trial of s.c. etanercept (ES 0.15/À0.23; SRM 0.22/0.22) [45].
SF-36 vitality subscale [29] One fair-quality study provided acceptable evidence of construct validity [42]: a strong association between the VT subscale and the FACIT-fatigue was reported (r ¼ 0.74; r ¼ 0.82), a moderate association with the 10 cm VAS (r ¼ À0.49) and a weak association with the BASFI (r ¼ À0.33). One good-quality study provided acceptable evidence of internal consistency and item-level performance (Cronbach's a 0.78/0.88; item-total correlation 0.57/0.64) [42]. Single-item fatigue PROMs 10 cm fatigue-severity VAS [31] One good-quality study provided acceptable evidence of construct validity [42]. A strong association between the item and the FACIT-fatigue (r ¼ À0.69), and a moderate association with the SF-36 VT (r ¼ À0.49) was reported after completion by AS patients participating in a double-blind, placebo-controlled clinical trial, supporting a priori hypothesized associations. A level of test-retest reliability judged to be below accepted standards for group analysis (ICC ¼ 0.60) was reported after a 6-week test-retest period in patients defined as stable on the EuroQoL EQ-VAS (general health); the study was judged to be of fair quality [40]. However, estimates for test-retest reliability were below accepted thresholds for use with groups (0.70) or individuals (0.90) [47]. In comparison with participants who received placebo or NSAIDs (small ES À0.35) [46], large ES statistics (ES ¼ 0.89; SRM ¼ 0.89; Guyatt statistics 0.92) were reported at 6 weeks for participants receiving the active, spa therapy intervention [40].
Modified 10 cm VAS [32] The 10 cm fatigue-severity VAS descriptor none was modified to no problem, changing the response scale.
BFI-WF-NRS [30] One qualitative study explored the relevance and acceptability of the WF-NRS single item taken from the BFI [48]. Although the item was judged to be relevant, the phraseology was confusing ('what best describes your worst fatigue'). A longer recall period than 24 h was also recommended, to express fatigue variability better.

Discussion
Greater understanding of the impact of fatigue has been identified as a priority by axSpA patients [3]. However, current assessment guidance is limited to a single-item measure of fatigue severity [4], which underestimates the often profound and wide-ranging impact of fatigue on an individual's life. Of the nine reviewed measures, only three were multidimensional, containing items reflecting different aspects of fatigue. However, no measure was specific to the experience of axSpA-fatigue and none had been evaluated for its relevance to axSpA patients. There was limited and often poor-quality evidence of reliability and construct validity; and an absence of interpretative guidance and evidence of measurement error, content validity or structural validity for any of the reviewed measures. Evidence of responsiveness was limited to the reporting of effective size statistics, which fail to provide an accurate evaluation of the ability of a measure to detect meaningful change in health [19]. Consequently, the lack of minimal measurement evidence for validity and reliability means that it is not possible to make any assessment recommendations. This is the first review of the quality and acceptability of measures of fatigue after completion by patients with axSpA. The results are strengthened by an evaluation of both study [20,21] and PROM quality [16,18,19,23], paired with a detailed comparative appraisal of item content. However, much of the extracted data came from studies where PROM evaluation was not the primary focus of the study. As such, the rigour of the COSMIN criteria meant that these studies typically scored poorly. Although a single reviewer (N.A.P.) assessed all titles and abstracts for review eligibility, a Assessing fatigue in axial spondyloarthritis https://academic.oup.com/rheumap sub-set of titles and abstracts were reviewed by a second reviewer (K.L.H.) and reliability was checked.
Adoption of the RA-fatigue conceptual model in the present review highlighted the limited content validity of the reviewed measures. No PROM fully reflects the RAmodel of fatigue. Both the MAF and the FSS include the assessment of fatigue frequency and severity, two important components of the fatigue experience for axSpA patients [8]. However, only two PROMs {the MFI-20 [10/20 (total) items] and FACIT-fatigue (6/13 items)} include items that seek to assess the cognitive/behavioural (and emotional) impact of fatigue. Other PROMs (MAF, MFSI-SF and FSS) include items limited to only two of the cognitive/behavioural domains. Although adequate evidence of internal consistency and reliability was reported for the MFI-20, it is unclear whether the PROM can detect change, or if it measures components of fatigue important to axSpA patients. Acceptable, but limited, evidence of a strong association between the FACIT-fatigue and SF-36 VT enhances confidence in the ability of the FACIT-fatigue to measure fatigue in this population. However, evidence of measurement reliability and responsiveness is lacking in the axSpA population. Consequently, although demonstrating acceptable item content, both measures lack acceptable evidence of essential psychometric properties currently to support their use in axSpA-fatigue assessment. A robust fatigue assessment is necessary to detect and detail the nuances of fatigue experience that are essential to providing individualized and tailored health care to axSpA patients.
Qualitative research has detailed a similar experience of fatigue in axSpA, highlighting the significant impact of fatigue on social life, patient mental health and relationships with others, their ability to engage with usual activities of daily living [49] and their reliance on selfmanagement strategies [2]. This demonstrates the importance of considering these aspects in the assessment of fatigue impact, and the insufficient information available from using only a single-item VAS of fatigue severity [49]. Similarities between RA and axSpA-fatigue experience support the appropriateness of the RA-fatigue model as a framework against which PROM content and relevance can be appraised for use with the axSpA population [49,50]. However, growing evidence demonstrates that fatigue experience is a dynamic, complex and multifaceted experience that is, to a large extent, disease specific. For example, evidence has shown both similarities and differences in fatigue experience between related and unrelated conditions (FM, multiple sclerosis, AS and stroke) [51] and between different stages of illness, such as patients with active cancer compared with cancer survivors [52]. Therefore, although this review has used the RA-fatigue model to appraise PROM item content, it is essential that a conceptual model is developed to reflect the nuances specific to the experience and impact of axSpA-fatigue.
A review of the quality of fatigue measures used in a range of chronic illnesses also highlighted the lack of evidence of essential measurement properties, thus limiting recommendations [53]. However, the judgement of measurement quality lacked transparency, and study methodological quality was not determined. International guidance promotes the importance of greater transparency in the assessment of measurement quality and acceptability [19,23,54]. Adoption of the COSMIN checklist, as in the present review, facilitates the incorporation of study methodological quality in the final judgement of PROM quality [20][21][22].
Well-developed, patient-derived PROMs are both robust and relevant to the experience of patients, capturing the outcomes that really matter [10,55]. However, numerous legacy measures, where content was largely driven by the perspective of clinicians, may lack relevance to patients [10,55]. The failure of PROMs to capture the outcomes that really matter to patients [9,[56][57][58] undermines the potential contribution to patientcentred care and shared decision-making, and was the driver for the co-development of a new, patient-derived measure of fatigue for RA, namely the Bristol RA Fatigue Multi-dimensional Questionnaire [12,13]. Of the nine PROMs identified in this review, only four provided a limited conceptualization of fatigue, which was mostly derived from literature reviews and clinical experts. Only one PROM (FACIT-Fatigue) was developed following a qualitative method (semi-structured interviews) but did not provide a conceptualization of fatigue. Qualitative research offers greater insight into key health issues affecting patients, improving the relevance and acceptability of PROM content. This can highlight the unmet needs of patients, supporting targeted healthcare efforts to address what really matters to the patient.
The MFI-20 and FACIT-fatigue provide the most comprehensive assessment of fatigue [7], but evidence of their psychometric qualities in the axSpA population is limited.
A limited number of fatigue-specific PROMs have been evaluated for their quality and acceptability for use in axSpA fatigue assessment. However, recommendations are limited by the poor methodological quality of most studies coupled with the limited evidence of robust measurement or practical properties. These limitations also suggest that data generated from the application of these measures in routine practice or clinical research settings should be interpreted with caution. A comparative appraisal of PROM content suggests that the MFI and FACIT-fatigue provide the most comprehensive assessment of fatigue, including the impact on both cognition and behaviour. However, further exploration of the relevance and acceptability of the reviewed measures to patients with axSpA-fatigue is warranted. Moreover, comparative evaluations of those measures that have acceptable content validity are urgently required to establish robust evidence of essential measurement properties; specifically, reliability, validity, responsiveness and interpretation.

Supplementary data
Supplementary data are available at Rheumatology Advances in Practice online.