-
PDF
- Split View
-
Views
-
Cite
Cite
Florian Naye, Simon Décary, Catherine Houle, Annie LeBlanc, Chad Cook, Michèle Dugas, Becky Skidmore, Yannick Tousignant-Laflamme, Six Externally Validated Prognostic Models Have Potential Clinical Value to Predict Patient Health Outcomes in the Rehabilitation of Musculoskeletal Conditions: A Systematic Review, Physical Therapy, Volume 103, Issue 5, May 2023, pzad021, https://doi.org/10.1093/ptj/pzad021
- Share Icon Share
Abstract
The purpose of this systematic review was to identify and appraise externally validated prognostic models to predict a patient’s health outcomes relevant to physical rehabilitation of musculoskeletal (MSK) conditions.
We systematically reviewed 8 databases and reported our findings according to Preferred Reporting Items for Systematic Reviews and Meta-Analysis 2020. An information specialist designed a search strategy to identify externally validated prognostic models for MSK conditions. Paired reviewers independently screened the title, abstract, and full text and conducted data extraction. We extracted characteristics of included studies (eg, country and study design), prognostic models (eg, performance measures and type of model) and predicted clinical outcomes (eg, pain and disability). We assessed the risk of bias and concerns of applicability using the prediction model risk of bias assessment tool. We proposed and used a 5-step method to determine which prognostic models were clinically valuable.
We found 4896 citations, read 300 full-text articles, and included 46 papers (37 distinct models). Prognostic models were externally validated for the spine, upper limb, lower limb conditions, and MSK trauma, injuries, and pain. All studies presented a high risk of bias. Half of the models showed low concerns for applicability. Reporting of calibration and discrimination performance measures was often lacking. We found 6 externally validated models with adequate measures, which could be deemed clinically valuable [ie, (1) STart Back Screening Tool, (2) Wallis Occupational Rehabilitation RisK model, (3) Da Silva model, (4) PICKUP model, (5) Schellingerhout rule, and (6) Keene model]. Despite having a high risk of bias, which is mostly explained by the very conservative properties of the PROBAST tool, the 6 models remain clinically relevant.
We found 6 externally validated prognostic models developed to predict patients’ health outcomes that were clinically relevant to the physical rehabilitation of MSK conditions.
Our results provide clinicians with externally validated prognostic models to help them better predict patients’ clinical outcomes and facilitate personalized treatment plans. Incorporating clinically valuable prognostic models could inherently improve the value of care provided by physical therapists.
Introduction
Affecting 1.71 billion people worldwide in 2019, musculoskeletal (MSK) conditions are the most prevalent type of condition requiring rehabilitation.1 Evidence from meta-analyses on MSK conditions reveals that most interventions in rehabilitation have small to moderate effects.2–9 Stratified medicine has been touted as a promising avenue to improve clinical outcomes following rehabilitation and to provide better personalized care.10 Stratified medicine refers to splitting heterogeneous and oversimplified label conditions into homogeneous subgroups that share similar biological or risk characteristics.10 Prognosis-related findings represent a fundamental component of stratified medicine.11
Prognosis refers to the risk of future health outcomes or treatment response in people with a health condition.12 Prognosis goes beyond diagnosis, as it predicts the patient’s trajectory and outcomes, either poor or positive.12–14 Prognosis-related findings can be used to determine the person’s specific treatment needs.14, 15 Research involving prognosis investigates and determines specific factors, from biological, psychological, and social components, which are associated with a defined outcome trajectory.14, 15 The utilization of prognostic factors in isolation usually results in poor prediction and may lead to inappropriate interventions.15, 16 To improve the predictions in terms of outcome or treatment allocation, it is essential to use multiple prognostic factors combined within a prognostic model.16 Accordingly, based on the PROGRESS Framework, treatment allocation may be best informed by prognosis research involving prognostic models.16
A prognostic model is a formal combination of multiple prognostic factors allowing to estimate an individual risk at a specific endpoint.16 In clinical settings, these prognostic models are very useful.17 In addition to refining the prediction, the modifiable factors presented in the models represent therapeutic targets to be considered in the care plan.13 This clinical value has led to the development of prognostic tools in rehabilitation, such as clinical prediction rules (ie, rules that estimate the probability of future outcomes) and clinical decision rules (ie, also called prescriptive clinical prediction rules, because they suggest a course of action).18–20 However, most clinical prediction rules were developed with low methodological quality resulting in poor predictive ability.18–20 Standards on the methodology in the development of these models have only recently been proposed14, 21 and 3 basic steps must be followed to develop a prognostic model: (1) model development, (2) model validation (internal and external), and (3) clinical impact assessment.14, 16
Model validation can be determined through the process of internal and external validation.16 External validation is the most relevant step to obtain a first indication of the clinical value of a prognostic model.16 External validation consists of testing the developed and “internally validated” model on a new sample of participants to determine its generalizability and performance.16, 22 Two important predictive performance measures in the external validation step are discrimination (accuracy) and calibration (reliability).14, 23–25 Discrimination refers to the model’s ability to correctly distinguish between the absence and presence of the outcome.24, 25 Calibration is the agreement between predicted probabilities of occurrence and observed proportions of the outcome.24, 25 Contrary to discrimination, which cannot be improved, an external validation step can lead to an updated version of the model, allowing its recalibration.22, 24 Despite the importance of the external validation, this step is very often overlooked in most developed prognostic models.14, 18, 19
As prognosis can serve as a way to personalize the care of MSK patients,11 it is essential to provide a critical synthesis of the externally validated prognostic models to help clinicians incorporate high-quality prognostic data into their MSK practice. The main objective of this review was to identify and appraise externally validated prognostic models that aim to predict a patient’s health outcomes that are relevant to physical rehabilitation of MSK conditions. Additionally, we aimed to identify and describe externally validated prognostic models with the greatest value to physical rehabilitation clinicians.
Methods
We conducted a systematic review following the JBI guidelines26 and reported our findings according to the Preferred Reporting Items for Systematic Reviews and Meta-Analysis 2020 guidelines.27 This systematic review was registered with the PROSPERO database: CRD42020181959.
Search Strategy
Using an iterative process, the search strategies were developed and tested by an experienced medical information specialist in consultation with the review team. The MEDLINE strategy was peer reviewed by another senior information specialist, prior to execution, using the PRESS Checklist.28 Strategies utilized a combination of controlled vocabulary (eg, “prognosis,” “models, statistical,” and “physical therapy modalities”) and keywords (eg, “prediction tool,” “rehabilitation,” and “validity”). Vocabulary and syntax were adjusted across databases. There were no date restrictions on any of the searches but, when possible, animal-only records were removed from the results. Specific details regarding the strategies appear in Supplementary Appendix 1.
Information Sources
The systematic search was undertaken using multiple sources, including the following OVID databases: Ovid MEDLINE®, including Epub Ahead of Print and In-Process & Other Non-Indexed Citations, Embase Classic, Embase, PsycINFO, and the Cochrane Library databases included in EBM Reviews. We also searched CINAHL (Ebsco platform), Web of Science, and PEDro. All searches were performed on January 27, 2022.
Eligibility Criteria
For the title and abstract screening phase, the potential studies had to meet 5 criteria:
1. Participants: Adults or children who live with a MSK condition affecting physical functioning and requiring physical rehabilitation (ie, “a set of interventions designed to optimize functioning and reduce disability in individuals with health conditions in interaction with their environment”).29
2. Intervention: Prognostic models for physical rehabilitation practice.
3. Outcomes: Prognostic models that predict outcomes relevant to improve a patient’s health outcomes, as defined by the International Classification of Functioning, Disability and Health: Body function impairments, activity limitations, and/or participation restrictions.
4. Published either in English or in French.
5. Reported a study design to validate the prognostic model.
For the full-text screening phase, we applied one more criterion:
1. Design: The study had to pertain to the external validation phase.
Selection Process
After removing duplicates, 6 independent reviewers (AB, AP, CH, FN, MD, SD, and YTL) screened the study titles and abstracts. A calibration exercise on the first 25 citations was performed by the reviewers and if inter-rater agreement (κ statistic) was below k = 0.60 or if many selected references were irrelevant to the research question, eligibility criteria were clarified.30 Potential studies were full text screened by 3 independent reviewers (AP, CH, and FN) following another calibration exercise on the first 25 citations.30 In case of disagreement during the screening, 5 reviewers (AP, CH, FN, SD, and YTL) reached a consensus.
Data Collection Process
Two independent reviewers (CH and FN) extracted the data from the retained studies. In case of disagreement, a 3rd evaluator (SD or YTL) was brought in to reach consensus. We used the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies checklist for prognostic model studies31 and the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) reporting guidelines32 to guide data extraction based on likely variables reported in external validation studies. The 2 evaluators reviewed the data extraction form after having extracted the first 10 studies to ensure relevance and to standardize the extraction.
Data Items
We extracted the following study characteristics: Authors, year of publication, country, where the study was conducted, study design, settings in which prognostic models were validated (primary care, inpatient, and community), sample size and patients’ characteristics, and the MSK conditions requiring physical rehabilitation (low back pain, acute ankle sprain, etc).
We classified the study design as a prospective cohort design, randomized interventional trial, and retrospective design. If the data used in the study were not collected for the purpose of external validation of a prognostic model (principal objective of the study), we decided to categorize it as a retrospective design. We made this choice because of the inherent limitations of this design.33
From these prognostic models, we extracted data that included: Predictors, type of prognostic model assessed (ie, clinical prediction rule,14 clinical decision rule,14 regression formula,17 and other17), name of the model, intervals of follow-up (Endpoint), performance measures (ie, calibration, discrimination, sensitivity, specificity, and likelihood ratios),23, 25 other relevant performance measures (ie, R,2 Brier score, and net benefit),34 update of the model, and methods used for external validation (ie, temporal, geographical, domain/setting, other).22, 32 We further extracted the outcome label (name) and measurement tool(s) as predicted clinical outcomes.
Study Risk of Bias and Applicability Assessment
We conducted risk of bias assessment using the Prediction model Risk of Bias Assessment Tool (PROBAST)21 as advised by the Cochrane Prognosis Methods Group for prognostic reviews. The PROBAST examines 4 domains: Participants, predictors, outcome, and analysis and explores risk of biases as well as applicability (ie, “Concern that the included participants and setting do not match the review question”21). Two independent evaluators (CH and FN) completed the assessment. In case of disagreement, a 3rd evaluator (SD or YTL) was involved in reaching consensus. The risk of bias of each study was rated as low, high, or unclear according to the 4 domains of the PROBAST and their applicability to the review question was rated as low, high, or unclear concern. Because the inter-rater reliability and agreement are not stable within the PROBAST domains,35 we performed a calibration training with 5 studies. We applied the 2 following criteria: (1) Inter-rater reliability (κ ≥ 0.6)30 and (2) inter-rater agreement (≥80%).36 In case of uncertainty or if a criterion was not met, a consensus meeting between evaluators was performed to render the PROBAST items more explicit.
Synthesis Methods
We conducted a narrative synthesis centered on the description of model characteristics to guide clinicians in their decision to use these prognostic models according to their practice context. The extracted data are not easy to interpret clinically. The PROBAST is a tool to assess the risk of bias and applicability but is of limited value for selecting impactful tools for clinical practice. The lack of a method or standards (benchmarks) to interpret performance measures also makes it difficult for clinicians to determine which prognostic models should be used. To address this limitation, we proposed a structured, 5-step method to determine which prognostic models were deemed clinically valuable to the physical rehabilitation clinicians. Based on current literature on prognostic model development, we identified 5 criteria that a model must fulfill to show clinical value. The model must show:
1. Low concerns about applicability.
2. Complete report of performance measures.
3. Calibration performance measure must be acceptable or good (Hosmer-Lemeshow test: P > .05,38 calibration slope = 1,39 calibration intercept = 0,39 and/or calibration plot considered as good or acceptable39).
4. Discrimination performance measure must be between 0.61 and 0.75 to be possibly helpful and >0.75 to be clearly helpful for clinicians23 and
5. Risks of bias must have minor impacts on the model.35 For example, a non-appropriated data source is more detrimental than predictors that are part of the outcome definition.
The predictive validity measures were not considered in this decision rule because they add little information on the discriminatory ability of models.40, 41
The models who met these 5 criteria are presented in the results section.
Role of Funding Source
The funders played no role in the design, conduct, or reporting of this study.
Results
Study Selection
We retrieved 4886 citations from our systematic search. Ten more articles were included from other sources (handsearching). Of the 4896 citations, 300 were retained for the full-text screening phase. We found 46 eligible studies, reporting on 37 models for spinal conditions, extremity conditions, and MSK injuries and trauma, met all the inclusion criteria and were selected for data extraction, risk of bias and applicability assessment (see Figure for the flow chart diagram).

Study Characteristics
Less than half (n = 20, 43.5%) of the studies retained were published before the TRIPOD reporting guidelines (2015) and 67.4% (31/46) of the studies were published before the PROBAST (2019). The data sources included prospective designs (n = 19, 41.3%), retrospective designs (n = 15, 32.6%), randomized control trials (n = 11, 23.9%), and mixed designs (RCT and prospective) (n = 1, 2.2%). Temporal validation (n = 24, 52.1%) and geographical validation (n = 17, 37%) were the 2 main methods for the external validation. The sample size in each study ranged from 91 to 28,919 participants, while their mean age ranged from 34 to 75 years old. We found 17 (37%) studies pertaining to spinal conditions, 19 (41.3%) studies pertaining to upper or lower extremity conditions, and 10 (21.7%) studies for MSK trauma, injuries, and pain (undefined region). The United States of America (n = 13) and Australia (n = 7) were the most represented countries. All extracted characteristics are available in Table.
. | n (%) . |
---|---|
Year of publication <2015 Between 2015 and 2018 ≥2019 | 20 (43.5%) 11 (23.9%) 15 (32.6%) |
Country United States of America Australia Sweden Denmark, Netherlands, United Kingdom Canada, Singapore, Switzerland, Multicentric France, Germany, Norway, Spain | 13 7 5 3 2 1 |
Settings Primary care Inpatient Community Two or more settings Unclear | 25 (54.3%) 13 (28.3%) 3 (6.5%) 4 (8.7%) 1 (2.2%) |
Study design Prospective cohort design Retrospective design Randomized interventional design Randomized interventional and prospective cohort design | 19 (41.3%) 15 (32.6%) 11 (23.9%) 1 (2.2%) |
Methods of external validation Temporal validation Geographical validation Domain/setting validation Other Unclear Mixed method | 24 (52.1%) 17 (37%) 2 (4.3%) 1 (2.2%) 1 (2.2%) 1 (2.2%) |
Model update Yes No | 4 (8.7%) 42 (91.3%) |
. | n (%) . |
---|---|
Year of publication <2015 Between 2015 and 2018 ≥2019 | 20 (43.5%) 11 (23.9%) 15 (32.6%) |
Country United States of America Australia Sweden Denmark, Netherlands, United Kingdom Canada, Singapore, Switzerland, Multicentric France, Germany, Norway, Spain | 13 7 5 3 2 1 |
Settings Primary care Inpatient Community Two or more settings Unclear | 25 (54.3%) 13 (28.3%) 3 (6.5%) 4 (8.7%) 1 (2.2%) |
Study design Prospective cohort design Retrospective design Randomized interventional design Randomized interventional and prospective cohort design | 19 (41.3%) 15 (32.6%) 11 (23.9%) 1 (2.2%) |
Methods of external validation Temporal validation Geographical validation Domain/setting validation Other Unclear Mixed method | 24 (52.1%) 17 (37%) 2 (4.3%) 1 (2.2%) 1 (2.2%) 1 (2.2%) |
Model update Yes No | 4 (8.7%) 42 (91.3%) |
. | n (%) . |
---|---|
Year of publication <2015 Between 2015 and 2018 ≥2019 | 20 (43.5%) 11 (23.9%) 15 (32.6%) |
Country United States of America Australia Sweden Denmark, Netherlands, United Kingdom Canada, Singapore, Switzerland, Multicentric France, Germany, Norway, Spain | 13 7 5 3 2 1 |
Settings Primary care Inpatient Community Two or more settings Unclear | 25 (54.3%) 13 (28.3%) 3 (6.5%) 4 (8.7%) 1 (2.2%) |
Study design Prospective cohort design Retrospective design Randomized interventional design Randomized interventional and prospective cohort design | 19 (41.3%) 15 (32.6%) 11 (23.9%) 1 (2.2%) |
Methods of external validation Temporal validation Geographical validation Domain/setting validation Other Unclear Mixed method | 24 (52.1%) 17 (37%) 2 (4.3%) 1 (2.2%) 1 (2.2%) 1 (2.2%) |
Model update Yes No | 4 (8.7%) 42 (91.3%) |
. | n (%) . |
---|---|
Year of publication <2015 Between 2015 and 2018 ≥2019 | 20 (43.5%) 11 (23.9%) 15 (32.6%) |
Country United States of America Australia Sweden Denmark, Netherlands, United Kingdom Canada, Singapore, Switzerland, Multicentric France, Germany, Norway, Spain | 13 7 5 3 2 1 |
Settings Primary care Inpatient Community Two or more settings Unclear | 25 (54.3%) 13 (28.3%) 3 (6.5%) 4 (8.7%) 1 (2.2%) |
Study design Prospective cohort design Retrospective design Randomized interventional design Randomized interventional and prospective cohort design | 19 (41.3%) 15 (32.6%) 11 (23.9%) 1 (2.2%) |
Methods of external validation Temporal validation Geographical validation Domain/setting validation Other Unclear Mixed method | 24 (52.1%) 17 (37%) 2 (4.3%) 1 (2.2%) 1 (2.2%) 1 (2.2%) |
Model update Yes No | 4 (8.7%) 42 (91.3%) |
Risk of Bias and Applicability in Studies
The overall judgment of risk of bias was high for all studies with 97.8% of studies presenting high risk of bias on analysis domain. The overall judgment of concerns about applicability was low for two-thirds of the studies (69.6%). Supplementary Table 1 presents the risk of bias and applicability assessment for each study according to the PROBAST.
Results of Individual Studies
Overall Information of Selected Prognostic Models
We found 46 studies reporting on 37 unique models. Nineteen studies (41.3%) reported information on calibration through 6 different methods (calibration slope and intercept, Hosmer-Lemeshow test, calibration plot, expected/observed ratio, comparison with previous data, and scatter plot). Twenty-eight studies (60.9%) reported information on discrimination through one method (AUC = C-statistic = C-index). The endpoint ranged from 4 days to 10 years with the main endpoints being at 3 months (n = 7), 6 months (n = 9), and 12 months (n = 6).
External Validation of Prognostic Models for Spinal Conditions
We found 17 studies, of which 11 were for low back conditions and 6 focused on neck conditions. For neck pain, the Sterling clinical prediction rule (n = 2/6),42, 42 Schellingerhout clinical prediction rule (n = 2/6),42, 43 and the Ritchie clinical prediction rule (n = 2/6)42, 44 were the most studied models. The sample size for each study ranged from 101 to 1193 participants and disability was the main outcome (n = 3/6). For low back pain, the STart Back Screening Tool (n = 4/11)45–48 and the Flynn clinical decision rule (n = 2/11)49, 50 were the most studied models. The sample size for each study ranged from 105 to 1528 participants and disability was the main outcome (n = 3/11). Supplementary Table 2 presents all the extracted characteristics of studies on spine conditions.
External Validation of Prognostic Models for Extremity Conditions
We found 19 studies, of which 15 were on lower extremity and 4 studies on upper extremity conditions. For the lower extremity, the main studied model was the Risk Assessment and Prediction Tool (n = 3/15).51–53 The sample size for each study ranged from 52 to 2863 participants. Discharge destination was the main outcome (n = 6/15). For the upper extremity, the sample size for each study ranged from 120 to 3637. Supplementary Table 3 presents all the extracted characteristics of studies on extremity conditions.
External Validation of Prognostic Models for Musculoskeletal Trauma, Injuries and Pain (Undefined Region)
We found 9 studies, of which 3 were on orthopedic trauma, 3 studies on MSK pain, and 3 studies on MSK injuries. The Wallis Occupational Rehabilitation RisK (WORRK) model (n = 2/9)55, 56 and the Örebro MSK Pain Questionnaire (n = 2/9)57, 58 were the most studied. The sample size for each study ranged from 107 to 28,919 participants. Return-to-work was the main outcome (n = 4/9). Supplementary Table 4 presents all the extracted characteristics of studies on MSK conditions in general.
Clinically Valuable Prognostic Models
We submitted the 46 studies that were retained to our 5-step process (described in the Synthesis Methods section). We observed that 16/46 showed high concerns about applicability, 17/30 showed no or incomplete report of performance measures, and 6/13 showed poor calibration. From the 7 studies that showed possibly helpful discrimination, 1 (1/7) showed risk of bias with major impact, which left 6 (6/7) studies with possibly helpful discrimination measures (Suppl. Material S1).
The 6 models with the greatest clinical value identified were (for an interactive version of this result, with additional information and resources, please visit https://view.genial.ly/62190374dcdc9300111d6d28/interactive-content-prognostic-models-in-musculoskeletal-rehabilitation or see Suppl. Material S2):
1. Forsbrand et al for their prediction of health-related quality of life (discrimination = 0.73) and work ability (discrimination = 0.68) at an endpoint between 11 and 27 months for people with acute/subacute low back or neck pain.46
2. Luthi et al for their prediction of return-to-work (discrimination = 0.73) at 24 months for people with orthopedic trauma.54
3. Da Silva et al for their prediction of number of days to pain recovery (discrimination = 0.71) at 1 month for people with acute low back pain.58
4. Traeger et al for their prediction of chronicity (discrimination = 0.66) at 3 months for people with low back pain.59
5. Schellingerhout et al for their prediction of global perceived recovery (discrimination = 0.66) at 6 months for people with neck pain.43
6. Keene et al for their prediction of poor outcome (ie, severe persistent pain and/or severe functional difficulty and/or significant lack of confidence in the ankle and/or recurrent sprain) (discrimination = 0.64) at 9 months for people with acute ankle sprain.60
Discussion
The objective of this study was to identify and appraise externally validated prognostic models that aim to predict a patient’s health outcomes that are relevant to physical rehabilitation of MSK conditions. We found 46 studies reporting on 37 unique models for spine, lower limbs, and non-specific MSK injuries. Although the risks of bias were high, 6 were deemed clinically valuable. These findings led us to 3 main considerations.
There are few prognostic models that are highly clinically valuable. To be considered clinically valuable, a model must at least show “adequate” calibration and discrimination performance measures at the end of the external validation phase.16, 23, 61 A more conservative approach would be to add a clinical applicability criterion, which we did from the results of the applicability assessment of the PROBAST,21 and risk of bias criteria. All models that presented adequate performance measures (calibration/discrimination) and low concern about applicability were considered at high risk of bias according to the PROBAST. However, we must take into consideration that the structure of the PROBAST is extremely conservative.35 For example, if 1 of the 18 items is judged as “absent,” the overall judgment is automatically considered at “high risk of bias.”21 Nevertheless, certain items have a major impact on the performance measures (eg, small dataset with inadequate number of events-per-variable), whereas others have a minor impact (eg, outcome is part of the predictors), an important aspect that is not taken into consideration by the PROBAST.35 We considered the elements that had major impact on the risk of bias. In absence of specific standards in using the PROBAST, we chose this route to determine which models that had a high risk of bias could still be clinically valuable.
Among the 6 models with greatest value, the study by Forsbrand et al reported calibration performance through the Hosmer-Lemeshow test without other information.46 This test had to be complemented by calibration plot or table comparing predicted versus observed outcome frequencies to provide useful information on calibration performance.21 Clinicians must remain conscious of this calibration limit when using the STart Back Screening Tool to predict health-related quality of life and/or work ability at 6 months in people with low back pain and/or neck pain. Thus, clinicians should keep in mind that the predicted results from the STart Back Screening Tool could slightly deviate from the observed outcomes at the endpoint.17
Even with our structured 5-step method, it remains complex for clinicians to determine if a given model is suitable for their practice. Further considerations need to be given to the population (ie, selection/recruitment of participants) and clinical settings in which the model was developed/validated. There is often heterogeneity, which makes the comparison difficult between models.33, 62 Also, clinicians must acknowledge the specific eligibility criteria of each model to guide their decision to use (or not) for a specific patient. Moreover, predicted outcomes of the selected models were generally in adequation with rehabilitation scope and patients’ needs.63 Some models used “recovery” as a predicted outcome; however, the definition of recovery varies between these models. This bias represents a major barrier for model comparison and transferability.33 Finally, clinical applicability is important to facilitate the integration of prognosis in clinical practice.
Of the 46 studies retained in this review, only a few showed potential value for clinical utilization. This is disappointing and highlights the difficulty of summarizing the trajectory of MSK conditions with few clinical variables. Indeed, prognostic models are designed to be pragmatic (ie, brief) tools that can be easily applicable in any clinical setting.16 For example, in the context of low back pain, experts in pain management have validated a diagnostic framework incorporating 51 modifiable factors that develop or maintain low back pain.64 Yet, it is not realistic to explore all potential modifiable and non-modifiable factors contributing to an individual’s trajectory via a simple and clinically usable prognostic model.65 To address this limitation, a stratified approach with prognostic models was introduced and showed promising impact.66–69
Two systematic reviews on prognostic models were recently published. Walsh et al focused on the identification of development and validation studies of clinical decision rules (CDR, ie, response to physical therapist interventions).20 For their part, Silva et al focused on the development and validation of prognostic models for acute low back pain.70 Our findings on CDR are convergent to Walsh et al’s conclusion that current literature does not support the use of the externally validated CDR.20 However, Walsh et al reported good overall risk of bias for validation studies, which is inconsistent with our review.20 This discrepancy is most likely explained by the use of different tools to assess risk of bias. Walsh et al used a non-prognosis risk of bias tool (ie, the Cochrane Effective Practice and Organization of Care group criteria),20 whereas we followed the Cochrane Prognosis Methods Group for prognostic reviews with the use of PROBAST. Nevertheless, our findings on predominant high risk of bias are consistent with those of Silva et al who used PROBAST for methodological quality assessment.70 Silva et al also highlighted the low report of performance measures, which is an important limitation for determining potential clinical value.70 According to the Silva et al’s review, the Da Silva prognostic model was found to be the most valuable one.70 This result is consistent with the findings of our structured 5-step method to determine potential clinically valuable models. However, a discrepancy appears regarding the PICKUP model. With our 5-step method, we found that this model could be possibly helpful, whereas Silva et al did not.70 This could be explained by 2 different discrimination thresholds. We decided to use discrimination thresholds based on clinical dimension,23 whereas Silva et al used higher thresholds based on mathematical considerations.37 As reported by Traeger et al, the PICKUP model showed higher discrimination performance than clinical judgment, which confirms our conclusion on its possible helpfulness.59 We therefore believe that the discrimination threshold for the clinical value of a prognostic model should be determined by the comparison between the prognostic model and clinical judgment. Compared with previous literature, we operationalized a synthesis method for clinical value that allows us to propose impactful models for clinicians.
Methodological and statistical concerns may limit the clinical integration of these prognostic models into clinical practice. From the PROBAST, we can conclude that the methods used for external validation of prognostic models in the MSK field have high risks of bias. This is consistent with previous systematic reviews,20, 62 reporting that limitation. These methodological flaws can lead to over- or under-fitting of the included models.
The clinical integration of these prognostic models may also be limited by 5 main aspects:
1. The lack of information on calibration and discrimination measures. From the 46 studies included in our review, 19 reported calibration information, and 28 reported discrimination information. Internal validation techniques to correct overfitting and optimism are insufficient to preserve the accuracy of the model in new patients.22 It is essential that publications on external validation of prognostic models include calibration and discrimination measures.23, 61, 71 These measures could inform clinicians on the relevance of a model. If calibration is poor on a new sample, this model is not useful in its present form.25 If discrimination is poor on the new sample, the model is not clinically useful.23, 24
2. Few domain/settings were used for the external validation. From the 46 studies included in our review, 2 used a domain setting as a method of external validation process. Differences between participants included in internal and external validation steps are potentially greater in domain/setting validation versus temporal validation.22 Models with temporal validation provide the weakest evidence of generalizability.22 Inspired by the proposal of Jenkins et al, a continual process including the different methods of external validation could provide information for model updating and improve the generalizability of the model.16, 22, 72
3. Small sample size. Numerous studies included in our review presented a small sample size that did not fit with the accepted rule-of-thumb of at least 100 events and 100 non-events.73, 74 When this rule-of-thumb is not reached, studies can report an inaccurate estimation of predictive performance measures that can be misleading for clinicians.71, 75 Even if this rule has been reached, it is not specific to the model and the validation setting.76 As a result, recent methodological studies on sample-size calculation must be integrated to improve the estimation of important performance measures.71, 76
4. Predictor(s) included in outcome bias. In some studies included in our review, the predictors were included in the outcome. This may lead to an overestimation of the association between the predictor and the outcome and can lead to an optimistic estimation of performance measures.21
5. Design. From the 46 studies included in our review, 19 used the actual gold standard design (ie, a prospective cohort developed to externally validate a specific prognostic model).21 Thus, inappropriate designs are still frequently used in the external validation of prognostic models.33 Randomized controlled trials (RCTs) often present more restricted eligibility criteria that lead to more homogeneous participants. This smaller case-mix tends to obtain lower discriminative ability.21 The inclusion and exclusion of potential participants is also problematic in retrospective design, based on routine care registries.77, 78
There is a definite need (call) for standardization. The weaknesses observed for external validation methodology can be explained in part by the recent publication of guidelines. Less than a half the studies were published before the TRIPOD publication. With the publication of guidelines and tools such as PROBAST, there is now an opportunity to standardize and improve the methods to perform meta-analysis. The extracted data from this review revealed a lack of standardization that makes a meta-analysis impossible. With the recent publication of the PROBAST, there is an opportunity to bridge this gap. From our perspective, some important key aspects could be added to the PROBAST tool to make it more comprehensive and usable. Due to the absence of a precise cut-off for calibration and discrimination,79 it could be difficult for clinicians to determine if a model is informative. Thus, the use of the R2 and/or Brier score could also be relevant to obtain easily interpretable information on the overall performance of models.34 Moreover, from a clinical perspective, information from a Decision Curve Analysis allows the clinician to determine if a model is likely to be useful for decision-making.79, 80 Thus, information on the net benefit (trade-off between benefit and harm) could be helpful to determine the best model for clinical integration.41 The development of computerized tools could also facilitate the use of the models in clinical settings. Indeed, after entering the patient’s data, clinicians could directly obtain his/her prognosis (see an example on http://myback.neura.edu.au/). Finally, the current external validation literature is complex and heterogeneous because of the number of different terms used to qualify the external validation step (eg, external validity, prognostic validity, predictive validity/ability/capacity, discriminative validity/ability). Most of these terms are relevant but not specific to the external validation step. This review was designed to identify external validation studies, based on search terms from prognostic research literature, and may have missed citations from rehabilitation literature using terminology that is different from external validation.
Recalibration or updating (addition of new predictor[s]) is uncommon in the prognostic models included in our review. Our results suggest that if the external validation of a model is poor, researchers are tempted to develop a new model, rather than to improve the first one.22, 81 Adjusting the model with the characteristics of the validation sample should be part of the validation study.14, 72, 81 In our review, 4 studies (8.7%) performed an update of the model. Moreover, this update is likely to improve the model’s generalizability.22 With this perspective in mind, Jenkins et al argue for continual updating and monitoring of prognostic models to reflect constantly evolving knowledge and practices.72
Limitations
Our review presents some limitations. The main one being probable overestimation of the risk of bias because of the PROBAST. The structure of this tool, the expertise required to complete it, and the complicated guidelines can lead to rating errors.35 In addition, our decision rule to determine clinical value is based on criteria from scientific literature and on the authors’ opinion regarding the PROBAST items. This strategy is not without potential bias, and experts’ opinion could refine and validate the content of our proposal. Because PROBAST is very recent (2019) and has high standards to assess the risk of bias in prognostic studies, we expected to find that most of the included studies would be categorized at high-risk of bias. Yet, considering the very conservative properties of the PROBAST, our findings are still impactful.35 Heterogeneity in the terminology used could lead to indexing or interpretation errors that could impact the results of our search strategy and/or selection process. There appears to be no standardized terminology in the rehabilitation literature concerning prognostic models. This standardization will be essential to improve prognostic literature and to eventually conduct meta-analyses of prognostic models in rehabilitation.
Conclusion
Our systematic review found 46 studies on 37 unique prognostic models spanning spinal, lower extremity, upper extremity, and MSK injuries, trauma, and pain (undefined region). Performance measures were not systematically reported in the included studies. According to the PROBAST, two-thirds of the studies presented low concerns about applicability, but only one study presented low risk of bias. We developed and performed a structured 5-step method allowing us to identify 6 prognostic models that could be deemed to be clinically valuable: (1) STart Back Screening Tool, (2) WORRK model, (3) Da Silva model, (4) PICKUP model, (5) Schellingerhout CPR, and (6) Keene model. Researchers must consider the PROBAST, and methodological issues reported in our review to propose high-quality external validation studies. We also recommend that researchers standardize the terminology used to report studies on the external validation of prognostic models.
Authors’ Contributions
Concept/idea/research design: F. Naye, Y. Tousignant-Laflamme, C. Houle, C. Cook, A. LeBlanc, S. Décary
Writing: F. Naye, Y. Tousignant-Laflamme, C. Houle, C. Cook, M. Dugas, B. Skidmore, A. LeBlanc, S. Décary
Data analysis: C. Houle, F. Naye, Y. Tousignant-Laflamme, M. Dugas, S. Décary
Project management: Y. Tousignant-Laflamme, S. Décary
Fund procurement: Y. Tousignant-Laflamme, S. Décary
Providing facilities/equipment: Y. Tousignant-Laflamme, S. Décary
Providing institutional liaisons: Y. Tousignant-Laflamme
Consultation (including review of manuscript before submitting): A. LeBlanc
Funding
This work was supported by the Ordre Professionnel de la Physiothérapie du Québec and the Strategy for Patient-Oriented Research.
Data Availability Statement
Data are available upon request.
Disclosures
The authors completed the ICMJE Form for Disclosure of Potential Conflicts of Interest and reported no conflict of interest.
Comments