Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain: A Cross-Sectional Study

Background Several studies have investigated criteria for classifying patients with low back pain (LBP) into treatment-based subgroups. A comprehensive algorithm was created to translate these criteria into a clinical decision-making guide. Objective This study investigated the translation of the individual subgroup criteria into a comprehensive algorithm by studying the prevalence of patients meeting the criteria for each treatment subgroup and the reliability of the classification. Design This was a cross-sectional, observational study. Methods Two hundred fifty patients with acute or subacute LBP were recruited from the United States and Australia to participate in the study. Trained physical therapists performed standardized assessments on all participants. The researchers used these findings to classify participants into subgroups. Thirty-one participants were reassessed to determine interrater reliability of the algorithm decision. Results Based on individual subgroup criteria, 25.2% (95% confidence interval [CI]=19.8%–30.6%) of the participants did not meet the criteria for any subgroup, 49.6% (95% CI=43.4%–55.8%) of the participants met the criteria for only one subgroup, and 25.2% (95% CI=19.8%–30.6%) of the participants met the criteria for more than one subgroup. The most common combination of subgroups was manipulation + specific exercise (68.4% of the participants who met the criteria for 2 subgroups). Reliability of the algorithm decision was moderate (kappa=0.52, 95% CI=0.27–0.77, percentage of agreement=67%). Limitations Due to a relatively small patient sample, reliability estimates are somewhat imprecise. Conclusions These findings provide important clinical data to guide future research and revisions to the algorithm. The finding that 25% of the participants met the criteria for more than one subgroup has important implications for the sequencing of treatments in the algorithm. Likewise, the finding that 25% of the participants did not meet the criteria for any subgroup provides important information regarding potential revisions to the algorithm's bottom table (which guides unclear classifications). Reliability of the algorithm is sufficient for clinical use.

D espite the ever-increasing numbers of randomized controlled trials (RCTs) evaluating interventions for low back pain (LBP), 1 treatment outcomes remain less than optimal. 2 Unfortunately, the majority of RCTs testing the effectiveness of LBP treatments demonstrate negligible to small effects. [3][4][5][6][7] One hypothesis for these small effects is that by using broad inclusion criteria in trials, we are testing interventions in a heterogeneous population of patients with LBP (eg, those with nonspecific LBP) 8 and such a diverse group of patients may not all respond to the one treatment. 9 This hypothesis (that not all patients should be expected to respond to a single treatment) has led to a movement in research to identify subgroups of patients who may respond preferentially to certain treatments, in the hope that treatment effects will be larger in these subgroups.
Recent research has identified subgroups of patients who may respond best to 4 common physical therapy interventions for LBP: spinal manipulation, stabilization therapy (eg, motor control training), specific exercise (eg, direction-specific exercise), and traction. 10 -14 Criteria for inclusion in 2 of the treatment subgroups (manipulation and stabilization) were identified by creating clinical prediction rules, 12,14 with the manipulation subgroup later tested in RCTs. 11,15 The criteria for the other 2 treatment subgroups (specific exercise and traction) were based on prevalent theories of clinical decision making, subsequently supported by RCTs. These studies limited the inclusion criteria to those patients believed to respond to the treatment, then compared groups who received and did not receive the treatment of interest. 10,13 For the remainder of this article, the criteria for the 4 subgroups identified by these studies will be termed individual subgroup criteria.
In order to translate the results of these individual studies 10,12-14 into a clinical reasoning strategy applicable to more routine clinical circumstances, a decision-making flowchart (termed the "classification algorithm") was developed (Figure). 16,17 The classification algorithm attempted to clarify 2 important considerations required for clinical utility. First, to be a clinically useful decision-making aid, a classification algorithm should place each patient into only one treatment category. Second, it should be comprehensive, able to categorize all patients for whom it is intended. If either of these requirements is not met, it will be unclear what treatment a therapist should provide, a result that is contrary to the purpose of an algorithm.
To meet these requirements, modifications to the individual study subgroup criteria were necessary. Because an individual patient with LBP may exhibit subgroup criteria from more than one category, a hierarchical order of clinical findings had to be proposed in the algorithm to result in a single treatment category for each patient. Conversely, because an individual patient may not clearly fit the subgroup criteria for any one treatment category, the algorithm had to provide additional criteria to help a therapist choose the "best" treatment subgroup for these patients. For the remainder of the article, the revised subgroup criteria used by the classification algorithm will be termed comprehensive algorithm. Preliminary evidence exists to support the effectiveness of the comprehensive algorithm (and thus these modifications). 18 An RCT showed that patients who received "matched" treatment according to their classification based on the algorithm (3-treatment subgroup ver-sion) had significantly lower disability levels at 4 weeks and 1 year than those who received "unmatched" treatment. 18 By studying the translation of individual studies into the broader clinical decision-making algorithm, it may be possible to improve or refine the comprehensive algorithm. In creation of the comprehensive algorithm, this translation included the decisions made regarding modification of the individual subgroup criteria and development of the hierarchical structuring. The majority of these decisions were based upon anecdotal evidence and assumptions of what findings should take precedence over others, and so need to be tested. For example, although mutual exclusivity among classification categories is forced by the structure of the comprehensive algorithm, the degree to which treatment subgroups identified by the comprehensive algorithm actually overlap based on individual subgroup criteria is currently unclear.
There are case reports in the literature of patients who have met the individual subgroup criteria of more than one treatment subgroup, 19,20 but there has been no systematic attempt to quantify the nature or degree of overlap among categories in a clinically representative sample. Data are needed to determine the extent of overlap among subgroups and establish in which subgroups overlap is most common using the individual subgroup criteria that informed the comprehensive algorithm. In addition, it is unclear how many patients do not actually meet the criteria for any of the 4 treatment subgroups based on the individual subgroup criteria and how often the additional criteria in the second part of the comprehensive algorithm are used to classify a patient. The extent of these issues would be best established by a prospective survey of consecutive cases of patients receiving physical therapy. Such a study would provide important information on the comprehensive algorithm and could assist in recommendations for its modification and improvement.
Therefore, this study had 3 purposes: (1) to determine the proportion of patients meeting the criteria for each treatment subgroup, more than one subgroup, or none of the subgroups and characterize the subgroups that overlap (if applicable) using the criteria from the individual studies that informed the classification algorithm (individual subgroup criteria) 10,12-14 ; (2) to determine the prevalence of subgroups based on the comprehensive classification algorithm (comprehensive algorithm); and (3) to assess the interrater reliability of the classification decision when using the comprehensive algorithm in therapists with limited experience using the algorithm.

Study Design
A cross-sectional, observational study was used to determine the prevalence of patients meeting the criteria for each treatment subgroup. A test-retest study design was used in a subset of patients to determine

Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain
interrater reliability of the classification decision when using the comprehensive algorithm.

Setting
Study recruitment occurred from June 2008 to June 2010 in Sydney, New South Wales, Australia, and Salt Lake City, Utah, and from June 2009 to June 2010 in Perth and Bunbury, Western Australia, Australia. Patients were recruited from teaching hospitals (Sydney, Australia) and from private physical therapy clinics (Australia and United States). The interrater reliability study was performed only at the Sydney, Australia, sites.

Participants
A total of 250 patients with acute or subacute LBP were recruited into the study. In order to be included in this study, patients had to be between the ages of 18 and 65 years, have LBP in the area bounded superiorly by T12 and inferiorly by the buttock crease (with or without leg pain) that had lasted for more than 24 hours but less than 90 days, and have a modified Oswestry Low Back Pain Disability Questionnaire (ODQ) score of Ն20%. Patients were excluded if their most prominent symptoms were something other than LBP or leg pain (eg, pain in lumbar, thoracic, and cervical spine, where cervical pain is the worst), if they were currently pregnant, or if they had lumbar surgery in the preceding 6 months, previous spinal fusion or scoliosis rods or screws, any known or suspected serious pathology (eg, fracture, tumor), spinal steroid injections within the last month, or any previous sclerotic injections, botulinum toxin injections (in the low back, abdominal, or pelvic area), or denervation procedures.

Therapists
Registered physical therapists (from Australia and the United States) were recruited to act as assessors for this study. Therapists were instructed to enroll consecutive patients with LBP. In Australia, all therapists underwent standardized training in assessing the criteria needed to classify patients into subgroups and in applying the comprehensive algorithm. The training session lasted approximately 45 minutes and involved practice and discussion of assessment techniques and algorithm use. Continued communication between researchers and therapists occurred throughout the recruitment period to address any questions or problems. In the United States, patient data were obtained from 3 physical therapists who had been involved in clinical research using the comprehensive algorithm and had incorporated it into clinical care for at least 5 years. An additional 7 physical therapists in the United States with less experience using the algorithm received additional training conducted with similar methods as in Australia. Data regarding each physical therapist's clinical experience (years at work), postgraduate training, and exposure to the algorithm and length of time using the algorithm (United States versus Australia) were recorded. Two of the authors (T.R.S., M.J.H.) also acted as assessors at the Sydney sites, particularly for the reliability assessments.

Examination Procedures and Classification Criteria
Prevalence study. Once consent was given to participate in the study, each eligible patient completed a numeric pain rating questionnaire to assess severity of back pain using an 11-point scale (0 -10) 21 and completed a pain drawing to show the location of the pain. 22 The ODQ was used to assess LBP-specific disability. 23 The Fear-Avoidance Beliefs Questionnaire (FABQ) was used to measure fear-avoidance beliefs about work and physical activity. 24 During the initial examination of the patient, a registered physical therapist collected data on all items required to classify patients using the comprehensive algorithm. Some items were assessed via questionnaires (as described above), whereas other items were assessed during the history taking or physical examination. The eAppendix (available at ptjournal.apta.org) lists the items required to classify a patient and describes the protocol for assessing each of the individual items. A standardized assessment form was used for all patients.
Following a brief history taking (eg, collecting information regarding age, sex, prior history of LBP, best and worst positions for LBP, and duration of LBP), therapists began by completing a neurological examination on each patient. Muscle strength (force-generating capacity), sensation, and reflexes (knee and ankle jerk) were tested, with each category coded as either normal or abnormal. 25 Straight-leg-raise range of motion then was assessed bilaterally using an inclinometer (also checking for a positive straight leg raise or crossed straight leg raise). 26 Patients were observed during spinal movement for the presence of aberrant movements such as an instability "catch," 27 painful arc, 28 reversal of lumbopelvic rhythm, or thigh climbing 29 (present or absent). Following this examination, an assessment of active spinal movements was performed. Single and repeated trunk flexion and trunk extension in a standing position, single and repeated trunk flexion in a sitting position, sustained extension in a prone position (30 seconds), and single and repeated trunk extension in a prone position were performed. Repeated movements were capped at 10 repetitions. 16 The effect of spinal movements on pain intensity (no change, increased, or decreased) and

Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain
pain location (no change, centralization, or peripheralization) was recorded.
While the patient was in a prone position, passive range of hip medial (internal) rotation was assessed using an inclinometer. 30 Spinal mobility then was assessed using a posterior-anterior pressure test on the L1-5 spinous processes, with spinal segments judged as being normal, hypermobile, or hypomobile. 31,32 Whether the mobility assessment caused concordant pain also was recorded. If pain was present with mobility testing, the prone instability test was performed and its outcome recorded (positive or negative). 25,33 Therapists were advised that they could perform any additional examination procedures on the patient that they felt necessary; however, they were instructed to perform these procedures following their assessment of the comprehensive algorithm criteria.
Interrater reliability study. Patients giving informed consent to participate in the reliability study were reassessed by a researcher (T.R.S., M.J.H.) who was blinded to the outcome of the initial assessment. The same testing procedure as in the initial assessment was used, and the reliability assessment was completed within 4 days of the initial assessment. Only patients who remained stable between examinations were included. Stable patients were defined as those having less than 6 points of change on the ODQ. Six points has been shown to be the minimal clinically important difference on the ODQ. 23,34 Outcome Measures Classifying patients for the prevalence study. To determine the prevalence of patients meeting the criteria for each treatment subgroup, 2 methods of classification were used. First, patients were classified using the individual subgroup criteria that informed the algorithm (Tab. 1). 10,12-14 Each classification was considered separately; therefore, patients could fit one, more than one, or none of the classification categories. The criteria from the specific exercise subgroup RCT (eg, patient must demonstrate centralization with spinal movement testing) 10 were updated based on additional evidence of treatment response from a well-conducted RCT. 35 In this RCT, patients exhibiting a directional preference (decrease in pain intensity with spinal movement testing) who received treatment matched to their directional preference had better outcomes than those who received unmatched treatment. 35 Therefore, the criteria for specific exercise were updated to include patients who exhibited either centralization or a directional preference with spinal movement testing. Using the data on each criterion provided by the assessing therapist, the researchers rated each patient as either meeting or not meeting the criteria for each of the 4 treatment subgroups (manipulation, stabilization, specific exercise, and traction) for the individual studies.
Second, patients were classified using the comprehensive algorithm. The algorithm is structured in 2 parts ( Figure). The first part is a set of hierarchical boxes that guide the therapist to give precedence to certain subgrouping findings over other findings. If a patient meets the criteria within one of these boxes, the classification category is determined for that patient and the classification Demonstrated centralization or a directional preference (an improvement in pain intensity) during repeated movement testing in any one position (standing, sitting, or lying) Traction 13 (must meet all criteria) Signs and symptoms of nerve root compression (positive straight leg raise or reflex, sensory, or muscle strength deficit) and Pain or numbness extending distal to the buttock in the previous 24 hours and Peripheralization of pain with extension or positive crossed straight leg raise

Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain
process is complete. If the patient does not meet the subgroup criteria in any of the hierarchical boxes in the first part of the algorithm, the therapist is directed to the second part of the algorithm that provides additional classification criteria in the bottom box. At this point, the therapist must select the one classification that best fits the patient based on the criteria provided. Using this 2-part process, the comprehensive algorithm ensures that each patient will be placed into one, and only one, classification category. The treatment subgroup classification made by the assessing therapist was used for the comprehensive algorithm prevalence rates.
Interrater reliability of the comprehensive algorithm classification decision. To determine the interrater reliability of the classification decision, the treatment subgroup chosen by each therapist using the comprehensive algorithm was compared. In order to evaluate potential sources of error in the comprehensive algorithm decision, the reliability of the specific classification criteria used in the algorithm was also assessed (eg, criteria in the hierarchical boxes and in the bottom table).

Sample size estimation.
We designed the study to include 250 participants, as this sample size would provide sufficiently precise estimates of subgroup prevalence across the likely ranges of prevalence. For example, if the prevalence of a subgroup was 5.2%, the 95% confidence interval (CI) would span 2.5% to 9.0%; for a prevalence of 50%, the 95% CI would span 44% to 56%.

Data Analysis
Prevalence. Prevalence rates were calculated in 2 ways. First, using the individual subgroup criteria, the prevalence of patients meeting the criteria for each treatment subgroup was calculated as: number of patients meeting subgroup criteria/N, where N is the total number of patients (Nϭ250). These prevalence rates include patients who meet the criteria for more than one subgroup or no subgroups; therefore, they may not sum to 100%. These prevalence rates represent the proportion of patients seeking care who would be expected to meet the criteria for each treatment subgroup based on the individual subgroup criteria evaluated in isolation from the other subgroups. In addition, the prevalence of patients meeting the criteria for no subgroup, only 1 subgroup, only 2 subgroups, only 3 subgroups, or only 4 subgroups was calculated by dividing the number of patients by N (calculated generally and for each subgroup or subgroup combination). These prevalence rate calculations allow a patient to meet the criteria for only one of these groups (eg, patients who meet the criteria for the manipulation ϩ stabilization subgroup combination are only counted under patients meeting only 2 subgroups); therefore, the prevalence rates sum to 100%. These prevalence rates represent the specific breakdown of which treatment subgroups tend to overlap and how many patients do not meet the individual subgroup criteria for any treatment subgroups.
Second, using the comprehensive algorithm, the prevalence rate for each classification category was calculated as the number of patients placed into that category using the algorithm/N. Because the comprehensive algorithm requires each

Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain
patient to be placed into only one category, the prevalence rates for the 4 classification categories sum to 100%. In addition, the prevalence of patients with a "clear classification" (ie, the number of patients meeting a classification in the first part of the algorithm) was calculated as the number of patients classified in the hierarchical boxes/N. The prevalence of patients with an "unclear classification" was calculated as the number of patients classified using the second part of the comprehensive algorithm (eg, criteria in the bottom box/N).
Interrater reliability. Percentage of agreement and kappa coefficients with 95% CIs were calculated for the reliability of the classification decision between raters (when using the comprehensive algorithm) and for the reliability of the specific criteria comprising the comprehensive algorithm. If appropriate, a prevalence and bias adjusted kappa (PABAK) 36 value also was given. When missing data values were present for a variable for one rater only, this patient was removed from the reliability analysis, as a kappa coefficient could not be calculated. A sensitivity analysis was performed to determine whether any differences in reliability occurred between patients with "clear classification" (ie, based on the hierarchical boxes) and those with "unclear classification" (ie, based on using the second part of the comprehensive algorithm, the criteria in the bottom box).

Prevalence Study
A total of 545 patients were screened for inclusion to this study, with 295 excluded, reaching the goal of 250 patients. Reasons for exclusion were: LBP for more than 90 days (nϭ82), no current LBP (nϭ8), age greater than 65 years (nϭ62), age less than 18 years (nϭ6), ODQ score less than 20% (nϭ22), primary complaint not LBP (nϭ4), surgery within the previous 6 months (nϭ8), injections within the previous month

Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain
(nϭ6), currently pregnant (nϭ2), did not provide consent (nϭ10), unable to complete assessment due to therapist or patient timing issues (nϭ78), and incomplete data (nϭ12). Of the 250 patients included, 125 were recruited from the United States and 125 were recruited from Australia (98 recruited from Sydney, Australia, and 27 recruited from Perth or Bunbury, Australia). Patient demographics are shown in Table 2. Of all patients recruited, 82.8% were recruited from private practice settings, 16% from hospital settings, and 1.2% from another source (eg, seeing a physical therapist but not recruited from that therapist's clinic). Information regarding the physical therapists involved in the study is presented in Table 3.
The prevalence rates for each treatment subgroup for classification using the individual subgroup criteria are provided in Table 4. Based on the individual subgroup criteria, the most common treatment subgroup for which the criteria were met was specific exercise (prevalence of 44.8%), followed by manipulation (prevalence of 35.2%). Of the whole sample, 49.6% of the patients met the criteria for only one subgroup (nϭ124), and of these patients, 21.6% met the criteria for only the specific exercise group. The majority of patients met the criteria for at least one treatment subgroup, with 25.2% of the patients meeting the criteria for more than one subgroup. The most common combination of subgroups for which the criteria were met was manipulation ϩ specific exercise. Of the 22.8% of patients meeting the criteria for only 2 subgroups, 68.4% met the criteria for the manipulation ϩ specific exercise subgroup combination. Only a small number of patients met the criteria for 3 treatment subgroups (manipulation, stabilization, and specific exercise), calculated to be 2.4% of the total sample. The prevalence of patients who did not meet the individual subgroup criteria for any of the treatment subgroups was 25.2%.
Using the comprehensive algorithm, manipulation was the most commonly met treatment subgroup, with 42.0% of the patients assigned by the algorithm to this subgroup (Tab. 5). The prevalence of patients who met the criteria for the specific exercise subgroup was 30.8%. Of the whole

Reliability Study
A total of 32 patients (of the 98 patients recruited from Sydney for the prevalence study) agreed to participate in the reliability study, with only 1 patient not remaining stable between the initial and second assessments. Of these patients, 74.7% were recruited from hospital settings, 12.9% from private practice settings, and 9.7% from another source (eg, seeing a physical therapist but not recruited from that therapist's clinic). Table 2 presents the patient demographics, and Table 3 presents the therapist demographics. Completion of the second reliability assessment was done within a time frame ranging from immediately following the first assessment to 4 days after the first assessment. The majority of assessments (90%) were completed directly following the first assessment.

Discussion
This study demonstrated that subgroups of patients with LBP identified by the individual subgroup criteria were mutually exclusive in approximately 50% of cases (eg, patients met the criteria for only one subgroup). In the remaining 50% of cases, patients either met the criteria for more than one subgroup (25.2%) or did not meet the criteria for any subgroups (25.2%). Important aspects of the content validity of classification systems are exhaustiveness and mutual exclusivity of their

Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain
classification categories. 38,39 Therefore, these results confirm the necessity of developing a comprehensive algorithm for translation of the classification criteria into clinical practice. When classified based on the comprehensive algorithm that was constructed to overcome the concerns about exhaustiveness and exclusivity of the classification categories identified in individual research studies, 34.0% of the patients had an unclear classification. An unclear classification appeared to adversely affect the interrater reliability of the comprehensive algorithm. These findings provide important information for evaluating the translation of the individual subgroup criteria into the comprehensive algorithm. They also provide essential information for guiding additional research and refinement of the comprehensive algorithm in terms of optimal hierarchical ordering of treatments, potential addition of further treatments to the algorithm, and modifications to the criteria used in the algorithm. The reliability of the classification decision when using the comprehensive algorithm was moderate, suggesting it is appropriate for use in a clinical setting.
The prevalence rates for each subgroup based on the individual subgroup criteria were generally similar to those in the literature, with the exception of the stabilization subgroup. Direct comparison of prevalence rates between the current study and previous work are made with caution due to differences in outcome definitions and patient populations. For the manipulation subgroup, previous studies' prevalence rates ranged from 23% to 59% 12,15,18,40 (present study: 35.2%, 95% CIϭ29.3% to 41.1%). For stabilization, prevalence rates from the literature ranged from 24% to 26% 18,40 (present study: 12.8%, 95% CIϭ8.7% to 17.0%). The reason for Table 6.

Specific exercise
Centralize with 2 or more movements in the same direction?

80.6
Centralize with movement in one direction and peripheralize with movement in opposite direction?
Ϫ0 This study showed that approximately 25% of the patients with LBP met the criteria for more than one subgroup when using the individual subgroup criteria. This finding confirms the need for a hierarchical algorithm to define which subgroup will have priority when overlap occurs. Although some overlap was anticipated, the specific findings of this study provide important information for guiding future research into optimizing the comprehensive algorithm.
The most common subgroups that we found to overlap were the specific exercise and manipulation groups (using the individual subgroup criteria), accounting for 68.4% of patients who met 2 subgroups. Comparing the prevalence of classifications using the individual subgroup criteria with the classification made based on the comprehensive algorithm, it is clear that the algorithm tends to prioritize manipulation over the specific exercise classification. The reasons for this finding are apparent from examining the current algorithm. Although a clear classification of specific exercise is prioritized, as it is assessed prior to the manipulation criteria, the criteria for clear classification of specific exercise require centralization with movement and not the more prevalent finding of a directional preference. Use of a more conservative criterion of centralization likely moves many patients with a directional preference for movement, but not clear centralization, into the manipulation classification when applying the comprehensive algorithm for decision making. It is possible that some of these patients classified into the manipulation subgroup may achieve better outcomes with specific exercise. It also is possible that all patients with an overlap between specific exercise and manipulation should receive manipulation regardless of the presence of centralization. These are important areas for future research. It also is possible that patients meeting these 2 groups (or other combinations of groups) may do better if they receive both treatments initially rather than just the one treatment as per the decision based on the hierarchical comprehensive algorithm. This again is an important issue for future studies.
Another important finding of this study was that approximately 25% of the patients with LBP did not meet the criteria for any of the treatment subgroups when using the individual subgroup criteria. This is essential information for future research designed to refine the comprehensive algorithm and improve clinical outcomes. Currently, in these 25% of patients, use of the comprehensive algorithm would result in choosing a treatment subgroup that is unclear, as the algorithm forces all patients into a subgroup. Although the current study did not examine the clinical outcomes of the physical therapy care, it may be the case that patients with an unclear classification have inferior treatment outcomes. The implications of an unclear classification on clinical outcomes are an important topic for future research.
There are several possibilities for reducing the prevalence of unclear classifications using the comprehensive algorithm. First, the current comprehensive algorithm resulted in 34.0% of patients receiving an unclear classification using the additional criteria in the bottom It is important to consider the clinical implications of these findings. The finding that many patients meet the criteria for more than one Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain subgroup or do not clearly fit a subgroup (based on the individual subgroup criteria) suggests that when using the comprehensive algorithm, a therapist may begin with the treatment subgroup indicated by the algorithm and should monitor the patient's response to the treatment. Careful monitoring of a patient's response to treatment may be particularly important for those patients who do not clearly fit a subgroup and instead are classified using the bottom table of the algorithm.
Our study was the first to test the reliability of the comprehensive algorithm using 2 independent assessors. In a previous study, pairs of examiners scored previously obtained assessment findings from 60 patients with LBP using the earlier 3-treatment group classification algorithm. 16 This design examines the reliability of the structure of the algorithm, but does not include error that may be due to the testing procedures of the 2 raters and is likely to overestimate the true intertester reliability. In this previous study, 16 reliability was good (kappaϭ0.60, 95% CIϭ0.56 to 0.64, percentage of agreementϭ76%). We found comparable reliability with kappa values at moderate levels 37 (kappaϭ0.52, 95% CIϭ0.27 to 0.77, percentage of agreementϭ67%). In our study, when considering only patients with a clear classification, the reliability of the classification decision when using the comprehensive algorithm was good (kappaϭ0.69). When considering only patients with an unclear classification during the initial assessment, the reliability of the comprehensive algorithm decision was poor (kappaϭ0.23). These findings are important because, as mentioned above, the proportion of patients for whom the bottom table was used to determine their classification was 34.0%. Based on the reliability findings, it is unlikely that 2 therapists who are novices to the comprehensive algorithm would choose the same treatment for a patient when using the bottom table of the algorithm. This finding also indicates the need for further research to reduce the prevalence of unclear classifications. If the reliability of classification when the bottom table is used is better in therapists experienced in algorithm use, this would provide valuable information to support intensive training of therapists who are novices to algorithm use to reduce variability in clinical decisions.
Although the overall reliability of the classification decision when using the comprehensive algorithm was moderate, the reliability of the specific criteria used to classify patients was variable. Some criteria had very good reliability, such as "peripheralize with extension," which had a percentage of agreement of 96.8% (kappaϭ0.76). However, other criteria, such as detecting the presence of aberrant movements, demonstrated poor reliability (kappaϭ0.00). A previous study that investigated the reliability of detecting aberrant movements also demonstrated low kappa values (0.18; 95% CIϭϪ0.07 to 0.43). 16 Interestingly, in criteria related to assessment of centralization, negative kappa values were found, indicating that the reliability of detecting centralization was less than chance. However, for these categories, percentage of agreement was good (approximately 80%). In this situation, it appears that a low base rate paradox 43,44 occurred. For example, the occurrence of patients who "centralize with 2 or more movements [positions] in the same direction" was rare, and when it was present, the raters never agreed on its presence. The kappa value for this criteria, adjusted for low prevalence, is 0.61, suggesting good reliability. Previous studies showed good to excellent reliability for the detection of centralization, with kappa values ranging from 0.50 to 1.00, 45 which is in line with the adjusted kappa value. There is controversy regarding whether this adjustment should be made; therefore, it is suggested that both adjusted and unadjusted values should be reported. 46 A limitation of this study is that a relatively small sample was used for the reliability study, which resulted in somewhat imprecise reliability estimates. Furthermore, the majority of the patients in the reliability study were recruited from a hospital setting. Comparison of patient demographics between the prevalence study and the reliability study suggests that the latter had a greater percentage of women and recruited patients with a longer duration of LBP, which may result in reliability findings that are different from what we would find in patients who are recruited from a private practice or with a shorter duration of symptoms. Additionally, a relatively large percentage of patients who were screened did not meet the eligibility criteria (54%). Therefore, it is important to apply the results of this study only to a similar population. The large number of patients excluded emphasizes that the comprehensive algorithm was designed to apply to non-elderly patients with acute or subacute LBP. Elderly patients with LBP or those with chronic symptoms comprise a significant proportion of individuals with LBP receiving physical therapy, and these patients likely require a different decision-making strategy.
Although limitations are present, this study has good external validity. Patients were recruited from numerous locations (Sydney, Bunbury, and Perth, Australia, and Salt Lake City, Utah) and from a variety of settings (hospital and private practice physical therapy), generating a broad sample. The generalizability of the results was further supported by a Evaluation of a Treatment-Based Classification Algorithm for Low Back Pain post hoc analysis of the countryspecific prevalence rates (using the individual subgroup criteria) that demonstrated similar values between Australia and the United States (manipulation: 38.4% versus 32.0%; stabilization: 12.8% at both sites; specific exercise: 45.6% versus 44.0%; traction: 7.2% versus 12.0%). These prevalence rates are remarkably similar, given some of the differences in patient access to physical therapy care (eg, direct access in Australia, referral in the United States).

Conclusions
The results of this study suggest that approximately one half of patients with acute or subacute LBP seeking care meet the criteria for only one treatment subgroup using the individual subgroup criteria. Approximately 50% of patients either meet the criteria for more than one subgroup or none of the subgroups using the individual subgroup criteria. The decision-making strategy evaluated in this study (eg, the comprehensive algorithm) fit approximately one half of patients seeking physical therapy care for their LBP. This information is important for guiding future research that aims to refine or improve the comprehensive algorithm and physical therapy clinical decision making for patients with LBP. Further research is needed to explore the hierarchical ordering of the algorithm, the possibility of adding additional physical therapy treatments, and the refinement of criteria used in the algorithm. The overall reliability of the comprehensive algorithm classification decision is adequate for clinical use, but the reliability of the lower table is poor and should further be investigated.