Context:

In 2005, the Endocrine Society (TES) adopted the GRADE system of developing clinical practice guidelines. Grading of Recommendations, Assessment, Development, and Evaluation working group guidance suggests that strong recommendations based on low or very low (L/VL) confidence may often be inappropriate, and has offered a taxonomy of paradigmatic situations in which strong recommendations based on L/VL confidence estimates may be appropriate.

Objective:

We sought to characterize strong recommendations of TES based on L/VL confidence evidence.

Data Sources and Extraction:

We identified all strong recommendations based on L/VL confidence evidence published in TES guidelines between 2005 and 2011. We identified those consistent with one of the paradigmatic situations in the taxonomy.

Data Synthesis:

Two hundred six of 357 (58%) of the recommendations of TES were strong; of these, 121 (59%) were based on L/VL confidence evidence. Of these 121, 35 (29%) were consistent with one of the paradigmatic situations. The most common situation (13, 11%) was of a strong recommendation against the intervention because of low confidence evidence for benefit and high confidence evidence for harm. The remaining 86 (71%) comprised 43 (36%) “best practice” statements for which sensible alternatives do not exist; 5 (4%) in which recommendations were for “additional research”; 5 (4%) in which greater confidence in the estimates was warranted; and 33 (27%) for which we could not find a compelling explanation for the incongruence.

Conclusions:

Guideline panels should beware of formulating strong recommendations when confidence in estimates is low. Our taxonomy when such recommendations are appropriate may be helpful.

The Institute of Medicine defines clinical practice guidelines as “statements that include recommendations intended to optimize patient care that, ideally, are informed by a systematic review of evidence and an assessment of the benefits and harms of alternative care options” (1). Practice guidelines should present estimates of the benefits and harms of the relevant options and a rating of the confidence the guideline panel had in those estimates. Guideline panels then formulate recommendations graded according to the likelihood that following them, rather than following an alternative course of action, will result in more good than harm. Properly formulated recommendations with transparent rating of evidence and grading of recommendations should help clinicians work with patients to ensure they receive optimal care consistent with their values and preferences.

The Endocrine Society (TES) is a professional organization that develops clinical practice guidelines for the care of patients with endocrine disorders. In 2005, TES adopted the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) framework to develop evidence-based recommendations (a list of over 70 organizations using GRADE can be found at www.gradeworkinggroup.org).

The GRADE system grades recommendations as either strong (“just do it” recommendations) or weak (conditional, valid for most but not all patients) (2). The strength of recommendation reflects the confidence the panel has that patients will be better off if they were to receive care consistent with the recommendation. Guideline developers offer strong recommendations (labeled in guidelines as recommendations) when they are confident that the desirable consequences of following the recommendation substantially outweigh the undesirable consequences. They offer a weak recommendation (labeled in guidelines as suggestions) when the tradeoffs are unclear (eg, low confidence in effect estimates) or when the tradeoffs are closely matched. GRADE rates the confidence in estimates (quality of evidence) into 1 of the 4 following categories: high, moderate, low, or very low. This rating reflects the confidence the panel has in the estimates of risks and benefits drawn from the body of evidence. Randomized trials start rates as providing high confidence in estimates, but may be rated down because of risk of bias, imprecision, indirectness, and inconsistency of results (3). Observational studies start as providing low confidence in estimates, but can be rated down to very low, or rated up (usually because of large or very large effects or a strong dose-response relationship).

Panels facing evidence that yields estimates about which have low to very low confidence (L/VL) often formulate weak (conditional) recommendations. They typically offer strong recommendations when they have moderate to high confidence in the estimates of effects resulting in a clear tradeoff of risks and benefits in favor of the recommended course of action. Weak recommendations in the face of evidence warranting high confidence in the estimates of effect are often appropriate and occur when desirable consequences are closely matched with undesirable ones. In these instances, GRADE is particularly helpful as a system that separates the confidence in estimates from the strength of recommendations.

Cases of apparent incongruence between the confidence in the estimates of effect and the strength of recommendation may reflect the work of panels using GRADE properly. In these cases, we expect panels to recognize the apparent incongruence and provide a detailed account of their considerations. The GRADE Working Group has formulated paradigmatic situations in which panels might offer strong recommendations when they have only L/VL confidence in the estimates of effect (4); they have suggested that such circumstances rarely occur (Table 1). One justifiable instance of this practice is, for example, when recommending against a particularly invasive, dangerous, or expensive practice associated with uncertain benefits. Strong recommendations formulated on the basis of L/VL evidence may, however, represent a misapplication of GRADE. Recommendations in which evidence-recommendation incongruities are apparent represent index recommendations through which it is possible to judge the rigor of the guideline formulation process.

Table 1.

Paradigmatic Situations in Which Panels May Reasonably Offer Strong Based on Very Low or Low Confidence in Effect Estimates

ConditionExamplesFrequency, n (%)
When low quality of evidence suggests benefit in a life-threatening situation 1) We recommend increasing the glucocorticoids (GDs) dosage of congenital adrenal hyperplasia patients in situations such as febrile illness (>38.5°C), gastroenteritis with dehydration, surgery accompanied by general anesthesia, and major trauma (101 (1) 
When low quality of evidence suggests benefit and high quality of evidence suggests harm or with a very high cost 1) We recommend against screening for androgen deficiency in the general population (1113 (11) 
Low quality of evidence that may benefit early detection but high quality of evidence of possible harm and cost 
2) We recommend against a general policy of offering testosterone therapy to all older men with low testosterone levels (11
Low quality of evidence that treatment may benefit older men but high quality of evidence of possible harm  
When low quality of evidence suggests equivalence of 2 alternatives, but high quality evidence of less harm for one of the competing alternatives 1) We recommend that treatment by unilateral laparoscopic adrenelectomy be offered to patients with documented unilateral primary aldosteronism (127 (6) 
Low quality of evidence suggests that laparoscopic adrenelectomy results in similar rates of cure in comparison to open adrenelectomy but high evidence suggests less harm/morbidity 
2) We recommend against the use of oral hydrocortisone suspension and against the chronic use of long-acting potent GCs in growing patients with classic congenital adrenal hyperplasia (10
Low quality of evidence suggests that hydrocortisone tablets (weak steroid) are equally effective than potent steroids to prevent adrenal crisis and virilization but high quality of evidence suggests less harm with hydrocortisone 
When high quality of evidence suggest equivalence of 2 alternatives and low quality of evidence suggests harm in one alternative 1) Because available evidence suggests methimazole may be associated with congenital anomalies, propylthiouracil should be used as a first-line drug, if available, especially during first-trimester organogenesis (135 (14) 
High quality of evidence denotes equality in treatment of hyperthyroidism but low quality of evidence suggests harm in methimazole 
2) We recommend against the use of GCs that traverse the placenta, such as dexamethasone, for treatment of pregnant patients with congenital adrenal hyperplasia (10
High quality of evidence denotes equality in treatment with dexamethasone (strong steroid) and hydrocortisone but low quality of evidence suggests harm in dexamethasone 
When low to high quality of evidence suggests benefits in one important outcome (outcome A) and low/very low quality of evidence suggests possibility of harm in critical (outcome B) and the harm regarding outcome B is valued much more highly than any benefit vis-a-vis 1) We recommend against testosterone therapy in patients with prostate cancer (119 (7) 
High quality of evidence for moderate benefits of testosterone treatment in men with symptomatic androgen deficiency to improve bone mineral density and muscle strength. Low quality of evidence for harm in patients with or at risk of prostate cancer 
2) We recommend that treatment with growth hormone is contraindicated in the presence of an active malignancy (14
Low quality of evidence for the benefit of patients with growth hormone deficiency, but low quality of evidence of critical harm (tumor growth) 
Total number of appropriate strong recommendations based on very low or low confidence in effects estimates 35 (29) 
Best practice For patients with congenital adrenal hyperplasia, we recommend monitoring patients for signs of glucocorticoid excess, as well as for signs of inadequate androgen suppression (1043 (36) 
  Statements should not have been graded as sensible alternatives do not exist  
Mistaken judgment We recommend intensive lifestyle modification to the entire family and to the patient, and as the prerequisite for all overweight and obesity treatments for children and adolescents (155 (4) 
  Graded as low quality when there is moderate quality evidence for benefits  
Additional research We recommend additional investigation using rodents and primates to further define the specific targets of androgen action (165 (4) 
  Statements should not have been graded as sensible alternatives do not exist  
Lack of compelling explanation If a patient is unable or unwilling to undergo surgery, we recommend medical treatment with mineralocorticoids 33 (27) 
  Lack of evidence of mineralocorticoids being superior to other medical treatment (eg, anti-hypertensive medications) (12 
Total number of strong recommendations based on very low or low confidence in effects estimates not consistent with paradigmatic situations 86 (71) 
Total 121 (100) 
ConditionExamplesFrequency, n (%)
When low quality of evidence suggests benefit in a life-threatening situation 1) We recommend increasing the glucocorticoids (GDs) dosage of congenital adrenal hyperplasia patients in situations such as febrile illness (>38.5°C), gastroenteritis with dehydration, surgery accompanied by general anesthesia, and major trauma (101 (1) 
When low quality of evidence suggests benefit and high quality of evidence suggests harm or with a very high cost 1) We recommend against screening for androgen deficiency in the general population (1113 (11) 
Low quality of evidence that may benefit early detection but high quality of evidence of possible harm and cost 
2) We recommend against a general policy of offering testosterone therapy to all older men with low testosterone levels (11
Low quality of evidence that treatment may benefit older men but high quality of evidence of possible harm  
When low quality of evidence suggests equivalence of 2 alternatives, but high quality evidence of less harm for one of the competing alternatives 1) We recommend that treatment by unilateral laparoscopic adrenelectomy be offered to patients with documented unilateral primary aldosteronism (127 (6) 
Low quality of evidence suggests that laparoscopic adrenelectomy results in similar rates of cure in comparison to open adrenelectomy but high evidence suggests less harm/morbidity 
2) We recommend against the use of oral hydrocortisone suspension and against the chronic use of long-acting potent GCs in growing patients with classic congenital adrenal hyperplasia (10
Low quality of evidence suggests that hydrocortisone tablets (weak steroid) are equally effective than potent steroids to prevent adrenal crisis and virilization but high quality of evidence suggests less harm with hydrocortisone 
When high quality of evidence suggest equivalence of 2 alternatives and low quality of evidence suggests harm in one alternative 1) Because available evidence suggests methimazole may be associated with congenital anomalies, propylthiouracil should be used as a first-line drug, if available, especially during first-trimester organogenesis (135 (14) 
High quality of evidence denotes equality in treatment of hyperthyroidism but low quality of evidence suggests harm in methimazole 
2) We recommend against the use of GCs that traverse the placenta, such as dexamethasone, for treatment of pregnant patients with congenital adrenal hyperplasia (10
High quality of evidence denotes equality in treatment with dexamethasone (strong steroid) and hydrocortisone but low quality of evidence suggests harm in dexamethasone 
When low to high quality of evidence suggests benefits in one important outcome (outcome A) and low/very low quality of evidence suggests possibility of harm in critical (outcome B) and the harm regarding outcome B is valued much more highly than any benefit vis-a-vis 1) We recommend against testosterone therapy in patients with prostate cancer (119 (7) 
High quality of evidence for moderate benefits of testosterone treatment in men with symptomatic androgen deficiency to improve bone mineral density and muscle strength. Low quality of evidence for harm in patients with or at risk of prostate cancer 
2) We recommend that treatment with growth hormone is contraindicated in the presence of an active malignancy (14
Low quality of evidence for the benefit of patients with growth hormone deficiency, but low quality of evidence of critical harm (tumor growth) 
Total number of appropriate strong recommendations based on very low or low confidence in effects estimates 35 (29) 
Best practice For patients with congenital adrenal hyperplasia, we recommend monitoring patients for signs of glucocorticoid excess, as well as for signs of inadequate androgen suppression (1043 (36) 
  Statements should not have been graded as sensible alternatives do not exist  
Mistaken judgment We recommend intensive lifestyle modification to the entire family and to the patient, and as the prerequisite for all overweight and obesity treatments for children and adolescents (155 (4) 
  Graded as low quality when there is moderate quality evidence for benefits  
Additional research We recommend additional investigation using rodents and primates to further define the specific targets of androgen action (165 (4) 
  Statements should not have been graded as sensible alternatives do not exist  
Lack of compelling explanation If a patient is unable or unwilling to undergo surgery, we recommend medical treatment with mineralocorticoids 33 (27) 
  Lack of evidence of mineralocorticoids being superior to other medical treatment (eg, anti-hypertensive medications) (12 
Total number of strong recommendations based on very low or low confidence in effects estimates not consistent with paradigmatic situations 86 (71) 
Total 121 (100) 
Table 1.

Paradigmatic Situations in Which Panels May Reasonably Offer Strong Based on Very Low or Low Confidence in Effect Estimates

ConditionExamplesFrequency, n (%)
When low quality of evidence suggests benefit in a life-threatening situation 1) We recommend increasing the glucocorticoids (GDs) dosage of congenital adrenal hyperplasia patients in situations such as febrile illness (>38.5°C), gastroenteritis with dehydration, surgery accompanied by general anesthesia, and major trauma (101 (1) 
When low quality of evidence suggests benefit and high quality of evidence suggests harm or with a very high cost 1) We recommend against screening for androgen deficiency in the general population (1113 (11) 
Low quality of evidence that may benefit early detection but high quality of evidence of possible harm and cost 
2) We recommend against a general policy of offering testosterone therapy to all older men with low testosterone levels (11
Low quality of evidence that treatment may benefit older men but high quality of evidence of possible harm  
When low quality of evidence suggests equivalence of 2 alternatives, but high quality evidence of less harm for one of the competing alternatives 1) We recommend that treatment by unilateral laparoscopic adrenelectomy be offered to patients with documented unilateral primary aldosteronism (127 (6) 
Low quality of evidence suggests that laparoscopic adrenelectomy results in similar rates of cure in comparison to open adrenelectomy but high evidence suggests less harm/morbidity 
2) We recommend against the use of oral hydrocortisone suspension and against the chronic use of long-acting potent GCs in growing patients with classic congenital adrenal hyperplasia (10
Low quality of evidence suggests that hydrocortisone tablets (weak steroid) are equally effective than potent steroids to prevent adrenal crisis and virilization but high quality of evidence suggests less harm with hydrocortisone 
When high quality of evidence suggest equivalence of 2 alternatives and low quality of evidence suggests harm in one alternative 1) Because available evidence suggests methimazole may be associated with congenital anomalies, propylthiouracil should be used as a first-line drug, if available, especially during first-trimester organogenesis (135 (14) 
High quality of evidence denotes equality in treatment of hyperthyroidism but low quality of evidence suggests harm in methimazole 
2) We recommend against the use of GCs that traverse the placenta, such as dexamethasone, for treatment of pregnant patients with congenital adrenal hyperplasia (10
High quality of evidence denotes equality in treatment with dexamethasone (strong steroid) and hydrocortisone but low quality of evidence suggests harm in dexamethasone 
When low to high quality of evidence suggests benefits in one important outcome (outcome A) and low/very low quality of evidence suggests possibility of harm in critical (outcome B) and the harm regarding outcome B is valued much more highly than any benefit vis-a-vis 1) We recommend against testosterone therapy in patients with prostate cancer (119 (7) 
High quality of evidence for moderate benefits of testosterone treatment in men with symptomatic androgen deficiency to improve bone mineral density and muscle strength. Low quality of evidence for harm in patients with or at risk of prostate cancer 
2) We recommend that treatment with growth hormone is contraindicated in the presence of an active malignancy (14
Low quality of evidence for the benefit of patients with growth hormone deficiency, but low quality of evidence of critical harm (tumor growth) 
Total number of appropriate strong recommendations based on very low or low confidence in effects estimates 35 (29) 
Best practice For patients with congenital adrenal hyperplasia, we recommend monitoring patients for signs of glucocorticoid excess, as well as for signs of inadequate androgen suppression (1043 (36) 
  Statements should not have been graded as sensible alternatives do not exist  
Mistaken judgment We recommend intensive lifestyle modification to the entire family and to the patient, and as the prerequisite for all overweight and obesity treatments for children and adolescents (155 (4) 
  Graded as low quality when there is moderate quality evidence for benefits  
Additional research We recommend additional investigation using rodents and primates to further define the specific targets of androgen action (165 (4) 
  Statements should not have been graded as sensible alternatives do not exist  
Lack of compelling explanation If a patient is unable or unwilling to undergo surgery, we recommend medical treatment with mineralocorticoids 33 (27) 
  Lack of evidence of mineralocorticoids being superior to other medical treatment (eg, anti-hypertensive medications) (12 
Total number of strong recommendations based on very low or low confidence in effects estimates not consistent with paradigmatic situations 86 (71) 
Total 121 (100) 
ConditionExamplesFrequency, n (%)
When low quality of evidence suggests benefit in a life-threatening situation 1) We recommend increasing the glucocorticoids (GDs) dosage of congenital adrenal hyperplasia patients in situations such as febrile illness (>38.5°C), gastroenteritis with dehydration, surgery accompanied by general anesthesia, and major trauma (101 (1) 
When low quality of evidence suggests benefit and high quality of evidence suggests harm or with a very high cost 1) We recommend against screening for androgen deficiency in the general population (1113 (11) 
Low quality of evidence that may benefit early detection but high quality of evidence of possible harm and cost 
2) We recommend against a general policy of offering testosterone therapy to all older men with low testosterone levels (11
Low quality of evidence that treatment may benefit older men but high quality of evidence of possible harm  
When low quality of evidence suggests equivalence of 2 alternatives, but high quality evidence of less harm for one of the competing alternatives 1) We recommend that treatment by unilateral laparoscopic adrenelectomy be offered to patients with documented unilateral primary aldosteronism (127 (6) 
Low quality of evidence suggests that laparoscopic adrenelectomy results in similar rates of cure in comparison to open adrenelectomy but high evidence suggests less harm/morbidity 
2) We recommend against the use of oral hydrocortisone suspension and against the chronic use of long-acting potent GCs in growing patients with classic congenital adrenal hyperplasia (10
Low quality of evidence suggests that hydrocortisone tablets (weak steroid) are equally effective than potent steroids to prevent adrenal crisis and virilization but high quality of evidence suggests less harm with hydrocortisone 
When high quality of evidence suggest equivalence of 2 alternatives and low quality of evidence suggests harm in one alternative 1) Because available evidence suggests methimazole may be associated with congenital anomalies, propylthiouracil should be used as a first-line drug, if available, especially during first-trimester organogenesis (135 (14) 
High quality of evidence denotes equality in treatment of hyperthyroidism but low quality of evidence suggests harm in methimazole 
2) We recommend against the use of GCs that traverse the placenta, such as dexamethasone, for treatment of pregnant patients with congenital adrenal hyperplasia (10
High quality of evidence denotes equality in treatment with dexamethasone (strong steroid) and hydrocortisone but low quality of evidence suggests harm in dexamethasone 
When low to high quality of evidence suggests benefits in one important outcome (outcome A) and low/very low quality of evidence suggests possibility of harm in critical (outcome B) and the harm regarding outcome B is valued much more highly than any benefit vis-a-vis 1) We recommend against testosterone therapy in patients with prostate cancer (119 (7) 
High quality of evidence for moderate benefits of testosterone treatment in men with symptomatic androgen deficiency to improve bone mineral density and muscle strength. Low quality of evidence for harm in patients with or at risk of prostate cancer 
2) We recommend that treatment with growth hormone is contraindicated in the presence of an active malignancy (14
Low quality of evidence for the benefit of patients with growth hormone deficiency, but low quality of evidence of critical harm (tumor growth) 
Total number of appropriate strong recommendations based on very low or low confidence in effects estimates 35 (29) 
Best practice For patients with congenital adrenal hyperplasia, we recommend monitoring patients for signs of glucocorticoid excess, as well as for signs of inadequate androgen suppression (1043 (36) 
  Statements should not have been graded as sensible alternatives do not exist  
Mistaken judgment We recommend intensive lifestyle modification to the entire family and to the patient, and as the prerequisite for all overweight and obesity treatments for children and adolescents (155 (4) 
  Graded as low quality when there is moderate quality evidence for benefits  
Additional research We recommend additional investigation using rodents and primates to further define the specific targets of androgen action (165 (4) 
  Statements should not have been graded as sensible alternatives do not exist  
Lack of compelling explanation If a patient is unable or unwilling to undergo surgery, we recommend medical treatment with mineralocorticoids 33 (27) 
  Lack of evidence of mineralocorticoids being superior to other medical treatment (eg, anti-hypertensive medications) (12 
Total number of strong recommendations based on very low or low confidence in effects estimates not consistent with paradigmatic situations 86 (71) 
Total 121 (100) 

The standards of explicitness and rigor that TES adopted have created an opportunity for the evaluation of its guideline process. We conducted this systematic survey to characterize the patterns of decisions and judgments underlying TES guidelines formulated using the GRADE approach. In particular, we were interested in understanding strong recommendations based on L/VL evidence.

Materials and Methods

Data source

Because TES started using GRADE to support the formulation of its practice guidelines in 2005, we collected all TES practice guidelines issued from 2005 to 2011 and published in the website of TES as of December 21, 2011 (http://www.endo-society.org/guidelines/Current-Clinical-Practice-Guidelines.cfm).

Data extraction

Working independently and in duplicate, reviewers extracted all numbered recommendations included in TES practice guidelines. For each recommendation, reviewers noted the reported confidence in estimates (very low, low, moderate, high), strength of recommendation (weak, strong), and whether the recommendation was for or against an action.

We noted the design of the studies providing evidence supporting each recommendation as observational studies only (O), randomized controlled trials only (RCT), both (ORCT), or no study cited (NC). When systematic reviews were cited, we noted whether these were summarizing O, RCTs, or ORCTs.

We extracted the strong recommendations based on L/VL quality of evidence that were consistent with any of the 5 paradigmatic situations (Table 1). When such a recommendation did not fit on the taxonomy, we determined the reason: “best practice” statements (that should not have been graded as sensible alternatives do not exist, eg, “For patients with Congenital Adrenal Hyperplasia, we recommend monitoring patients for signs of glucocorticoid excess, as well as for signs of inadequate androgen suppression”); priorities for “additional research” (also not subject to grading); or when the confidence in estimates was erroneously underestimated (in which case the recommendations are truly strong based on high or moderate quality evidence). Our reviewers, working in duplicate, achieved substantial chance-adjusted agreement (κ = 0.7) in classifying recommendations; disagreements were resolved by discussion. In addition, we looked for descriptions of any other important factors that may have played a role in the decision-making process such as cost or resource use and evidence of the likely distribution of patient values and preferences in making these classifications independently and in duplicate.

Results

TES recommendations and the evidence supporting them

From 2005 to December 2011, TES issued 357 recommendations presented in 17 published guideline documents.

Most the recommendations were strong (n = 206, 58%). Evidence yielding low- to very-low-confidence estimates supported most recommendations (n = 256, 72%), most strong recommendations (n = 121, 59%), and almost all weak recommendations (n = 135, 92%). Systematic reviews supported 46 recommendations. Observational studies supported two-thirds of the recommendations (n = 233, 65%, Table 2).

Table 2.

Strength of Recommendations and Quality of Evidence

Strength QualityStrongWeakTotal (%)
VL&L (%)M&H (%)Total (%)VL&L (%)M&H (%)Total (%)
Evidence        
    RCT/SRR 10/7 (8) 11/2 (13) 21/9 (10) 18/11 (12) 4/1 (25) 22/12 (15) 43/21 (12) 
    ORCT/SROR 7/1 (5) 14/5 (16) 21/6 (10) 8/2 (7) 4/1 (25) 12/3 (8) 33/9 (9) 
    O/SRO 93/6 (77) 57/2 (67) 150/8 (72) 75/8 (55) 8/0 (50) 83/8 (55) 233/16 (65) 
    NC 11 (9) 3 (4) 14 (7) 34 (25) 34 (22) 48 (14) 
Direction        
    For 98 (81) 80 (94) 178 (87) 126 (93) 16 (100) 142 (94) 320 (90) 
    Against 23 (19) 5 (6) 28 (13) 9 (7) 9 (6) 37 (10) 
Total 121 (59) 85 (41) 206 (58) 135 (92) 16 (8) 151 (42) 357 (100) 
Strength QualityStrongWeakTotal (%)
VL&L (%)M&H (%)Total (%)VL&L (%)M&H (%)Total (%)
Evidence        
    RCT/SRR 10/7 (8) 11/2 (13) 21/9 (10) 18/11 (12) 4/1 (25) 22/12 (15) 43/21 (12) 
    ORCT/SROR 7/1 (5) 14/5 (16) 21/6 (10) 8/2 (7) 4/1 (25) 12/3 (8) 33/9 (9) 
    O/SRO 93/6 (77) 57/2 (67) 150/8 (72) 75/8 (55) 8/0 (50) 83/8 (55) 233/16 (65) 
    NC 11 (9) 3 (4) 14 (7) 34 (25) 34 (22) 48 (14) 
Direction        
    For 98 (81) 80 (94) 178 (87) 126 (93) 16 (100) 142 (94) 320 (90) 
    Against 23 (19) 5 (6) 28 (13) 9 (7) 9 (6) 37 (10) 
Total 121 (59) 85 (41) 206 (58) 135 (92) 16 (8) 151 (42) 357 (100) 

Abbreviations: Met/Dib, diabetes mellitus-metabolic syndrome; M&H, moderate and high; NC, no study cited; PGA, pituitary-gonad-adrenal; SR, systematic review; SRO, systematic review including O; SROR, systematic review including RCT and O; SRR, systematic review including RCT.

Table 2.

Strength of Recommendations and Quality of Evidence

Strength QualityStrongWeakTotal (%)
VL&L (%)M&H (%)Total (%)VL&L (%)M&H (%)Total (%)
Evidence        
    RCT/SRR 10/7 (8) 11/2 (13) 21/9 (10) 18/11 (12) 4/1 (25) 22/12 (15) 43/21 (12) 
    ORCT/SROR 7/1 (5) 14/5 (16) 21/6 (10) 8/2 (7) 4/1 (25) 12/3 (8) 33/9 (9) 
    O/SRO 93/6 (77) 57/2 (67) 150/8 (72) 75/8 (55) 8/0 (50) 83/8 (55) 233/16 (65) 
    NC 11 (9) 3 (4) 14 (7) 34 (25) 34 (22) 48 (14) 
Direction        
    For 98 (81) 80 (94) 178 (87) 126 (93) 16 (100) 142 (94) 320 (90) 
    Against 23 (19) 5 (6) 28 (13) 9 (7) 9 (6) 37 (10) 
Total 121 (59) 85 (41) 206 (58) 135 (92) 16 (8) 151 (42) 357 (100) 
Strength QualityStrongWeakTotal (%)
VL&L (%)M&H (%)Total (%)VL&L (%)M&H (%)Total (%)
Evidence        
    RCT/SRR 10/7 (8) 11/2 (13) 21/9 (10) 18/11 (12) 4/1 (25) 22/12 (15) 43/21 (12) 
    ORCT/SROR 7/1 (5) 14/5 (16) 21/6 (10) 8/2 (7) 4/1 (25) 12/3 (8) 33/9 (9) 
    O/SRO 93/6 (77) 57/2 (67) 150/8 (72) 75/8 (55) 8/0 (50) 83/8 (55) 233/16 (65) 
    NC 11 (9) 3 (4) 14 (7) 34 (25) 34 (22) 48 (14) 
Direction        
    For 98 (81) 80 (94) 178 (87) 126 (93) 16 (100) 142 (94) 320 (90) 
    Against 23 (19) 5 (6) 28 (13) 9 (7) 9 (6) 37 (10) 
Total 121 (59) 85 (41) 206 (58) 135 (92) 16 (8) 151 (42) 357 (100) 

Abbreviations: Met/Dib, diabetes mellitus-metabolic syndrome; M&H, moderate and high; NC, no study cited; PGA, pituitary-gonad-adrenal; SR, systematic review; SRO, systematic review including O; SROR, systematic review including RCT and O; SRR, systematic review including RCT.

Strong recommendations based on low- or very-low-confidence evidence (index recommendations)

We found 121 strong recommendations based on L/VL evidence, of which 98 (81%) recommended for and 23 (19%) recommended against a particular management strategy. Panelists judged 35 (29%) to be consistent with 1 of the 5 paradigmatic situations in which it is appropriate to offer a strong recommendation despite evidence warranting low to very low confidence in the estimates of effect (Table 1). Of the remaining 86 (71%) recommendations, panelists mistakenly rated down the quality of evidence in 5 (4%) recommendations (ie, the correct rating of confidence was moderate or high) and issued “additional research” and “best practice” statements in 5 (4%) and 43 (36%), respectively. We could not find a compelling explanation for the incongruence in 33 (27%) of the 121 strong recommendations based on L/VL evidence (Supplemental Tables 1–4, published on The Endocrine Society's Journals Online web site at http://jcem.endojournals.org).

Cost or resource used was mentioned as an important factor in 25 of the 121 recommendations (21%), but we could not identify with confidence the role cost played in determining the direction or strength of the recommendation.

Discussion

Main findings

The majority (58%) of TES recommendations are strong (ie, “just do it” recommendations). Their strength implies that guideline panelists were very confident that all patients would be better off if they were to receive guideline-concordant care. Strong recommendations also imply that, if fully informed, these patients would choose the recommended course of action. Such recommendations are often candidates for quality-of-care criteria.

Of the 206 strong recommendations, 121 were based on evidence warranting low to very low confidence in estimates of risk or benefit. About a third (29%) of these apparent incongruities were consistent with one of the paradigmatic situations, the most common reason being that the panelists had low confidence in the estimates of benefit, but high confidence in the estimates of harm or resource use, thus warranting strong recommendations against the intervention. Additionally, we found that panelists often inappropriately graded “best practice” and “additional research” statements. Finally, a compelling explanation was lacking for 33 (27%) of the incongruous recommendations, suggesting that these recommendations should probably have been formulated as weak or conditional.

Limitations and strengths of our analysis

Because our study relied on the guideline document, to the extent that panelist considerations were left implicit, our judgments, while reproducible, may not reflect the considerations panelists actually took into account when they formulated their recommendation. We might have mistaken a recommendation as not consistent with the taxonomy because the alternative was not clearly stated. Also infrequently stated were considerations related to resource use and about panelists' possible interest in promoting reimbursement for emerging technologies or services. Finally, we found limited reference to patient preferences in justifying recommendations. New methodological research may uncover paradigmatic situations beyond those we have identified (Table 1) in which strong recommendations are justified despite panelists having low confidence in the estimates of effect.

We focused this analysis on TES guidelines, rather than on all endocrine guidelines, because TES guidelines exclusively use the GRADE approach, a system of guideline formulation with explicit rules that separate judgments about confidence in the estimates from the strength of recommendation. Our familiarity with GRADE and our systematic and reproducible process for data extraction and classification of recommendations further strengthen our conclusions.

How our findings compare with other guideline assessments

Limited research has assessed the quality of evidence and level of recommendations of clinical practice guidelines. Hazlehurst et al (5) examined 5 endocrine guidelines and found that many recommendations were supported by L/VL evidence, but did not report the frequency of strong recommendations based on this evidence. Tricoci et al (6) studied the American College of Cardiology/American Heart Association guidelines and found that 48% were based on L/VL evidence and that this proportion is increasing over time. These authors did not examine the frequency of strong recommendations based on L/VL evidence. Two studies focused on guidelines using GRADE. Djulbegovic et al (7) found that 11 (64%) of the strong recommendations included in the clinical guidelines for the use of fresh-frozen plasma of the American Association of Blood Banks were based on L/VL evidence. In a study of the guidelines of the Society for Vascular Surgery, 65% of recommendations were strong and supported by L/VL evidence (8). None of these studies addressed the appropriateness of strong recommendations based on low or very low quality evidence.

Implications for guideline development and research

It is important to understand why there are so many strong recommendations in a field plagued by evidence yielding limited confidence in the estimates of effect about the available options. The taxonomy of paradigmatic situations in which these judgments are appropriate supports many of the strong recommendations of TES and might help future panels struggling with making decisions regarding strong or weak recommendations in the face of L/VL evidence. Our results highlight the need for caution in making strong recommendations in the face of limited confidence in estimates: many of the recommendations appear to be inappropriately strong, suggesting the need for improvement in the training of guideline panelists or in the guidance they receive from methodological experts.

Other explanations exist for incongruent recommendations. A recent study (9) that chronicled the experience of using GRADE in developing national guidelines described a voting system used to decide the strength of recommendations; voting, in almost all cases, favored strong recommendations irrespective of the confidence in estimates. The author speculated that guideline developers might perceive weak recommendations as not helpful. Alternatively, panels may have ignored the quality of the evidence and preferred to make recommendations in accord with their practice style. Panelists may also consider formulating strong recommendations to ensure payers will cover a service supported by L/VL evidence.

Studying incongruent recommendations has uncovered many instances of appropriately strong recommendations in the face of L/VL evidence consistent with the GRADE taxonomy. Clinicians, however, may have difficulty distinguishing between these appropriately strong recommendations and those in which a weak or conditional recommendation would have been more congruent with the quality of evidence. When panels, as is the case with TES, frequently offer strong recommendations in the face of L/VL evidence, clinicians may be wise to exercise caution in following such recommendations. In turn, panelists should consider the possible deleterious impact of formulating such recommendations. First, some patients may receive treatments they would not have chosen. Second, such recommendations may engender mistrust in the guidelines. Third, issuing these recommendations may delay or prevent research, among other reasons, by hindering patient enrollment.

To avoid the inappropriate grading of “best practice” and “additional research” statements, we suggest that panelists use different language than that used in graded recommendations. For instance, “best practice” statements may read, “Best practice is to conduct a physical examination on the patient.” Panelists can offer research direction suggesting priorities (eg, “Identifying specific targets of androgen action in rodents is a research priority”). In addition, it is important to remember that in the context of guidelines the term recommend and suggest should be reserved for strong and weak recommendations, respectively.

This analysis highlights how much comparative effectiveness research—asking important questions and measuring the effects of alternatives on outcomes that matter to patients—is necessary in endocrinology: more than half of all RCTs supporting a recommendation had problems of imprecision or indirectness or produced inconsistent results (Table 2). The best available evidence for the rest of the recommendations came from unsystematic clinical observations. Endocrinologists and their patients should take an active role in promoting and conducting comparative effectiveness research. Furthermore, they should lobby, participate, and police the work of the Patient Centered Outcomes Research Institute and the Agency for Healthcare Research and Quality, ensuring these agencies fund randomized trials measuring outcomes important for patients.

Conclusions

Guideline panels should beware of formulating strong recommendations when confidence in estimates is low. This was a common, but sometimes appropriate, practice in TES guidelines. Our taxonomy of paradigmatic situations when such recommendations are appropriate may be helpful to guideline panelists and to clinicians, who should carefully consider the appropriateness of strong recommendations based on evidence yielding limited confidence in the estimates of effect.

Acknowledgments

J.P.B., J.P.D., M.H.M., and V.M.M. are part of the Knowledge and Evaluation Research Unit that contracts with nonprofit professional organizations and public agencies including The Endocrine Society to conduct systematic reviews and support the formulation of clinical practice guidelines. G.H.G., M.H.M., and V.M.M. are active members of the GRADE working group.

Disclosure Summary: The authors have no conflicts of interest to disclose.

For editorial see page 3174

Abbreviations

     
  • GRADE

    Grading of Recommendations, Assessment, Development, and Evaluation

  •  
  • L/VL

    low or very low

  •  
  • NC

    no study cited

  •  
  • O

    observational studies only

  •  
  • ORCT

    both observational studies and randomized controlled trials

  •  
  • RCT

    randomized controlled trials only

  •  
  • TES

    The Endocrine Society.

References

1.

Graham
R
,
Mancher
M
,
Wolman
DM
,
Greenfield
S
,
Steingerg
E
.
Committee on Standards for Developing Trustworthy Clinical Practice Guidelines. Institute of Medicine
.
Clinical Practice Guidelines We Can Trust
. 1st ed.
Washington, DC
:
National Academies Press
;
2011
.

2.

Brozek
JL
,
Akl
EA
,
Alonso-Coello
P
for the GRADE Working Group
.
Grading quality of evidence and strength of recommendations in clinical practice guidelines. Part 1 of 3. An overview of the GRADE approach and grading quality of evidence about interventions
.
Allergy
.
2009
;
64
:
669
677
.

3.

Balshem
H
,
Helfand
M
,
Schünemann
HJ
, .
GRADE guidelines: 3. Rating the quality of evidence
.
J Clin Epidemiol
.
2011
;
64
:
401
406
.

4.

Andrews
JC
,
Schunemann
HJ
,
Oxman
AD
, .
GRADE guidelines 15: going from evidence to recommendation-determinants of a recommendation's direction and strength
.
J Clin Epidemiol
.
2013
;
66
:
726
735
.

5.

Hazlehurst
J
,
Armstrong
M
,
Sherlock
M
, .
A comparative quality assessment of evidence-based clinical guidelines in endocrinology
.
Clin Endocrinol (Oxf)
.
2013
;
78
:
183
190
.

6.

Tricoci
P
,
Allen
JM
,
Kramer
JM
,
Califf
RM
,
Smith
SC
Jr
.
Scientific evidence underlying the ACC/AHA clinical practice guidelines
.
JAMA
2009
;
301
:
831
841
.

7.

Djulbegovic
B
,
Trikalinos
TA
,
Roback
J
,
Chen
R
,
Guyatt
G
.
Impact of quality of evidence on the strength of recommendations: an empirical study
.
BMC Health Serv Res
.
2009
;
9
:
120
.

8.

Murad
MH
,
Montori
VM
,
Sidawy
AN
, .
Guideline methodology of the Society for Vascular Surgery including the experience with the GRADE framework
.
J Vasc Surg
.
2011
;
53
:
1375
1380
.

9.

Agweyu
A
,
Opiyo
N
,
English
M
.
Experience developing national evidence-based clinical guidelines for childhood pneumonia in a low-income setting—making the GRADE?
BMC Pediatr
.
2012
;
12
:
1
.

10.

Speiser
PW
,
Azziz
R
,
Baskin
LS
, .
Congenital adrenal hyperplasia due to steroid 21-hydroxylase deficiency: an Endocrine Society clinical practice guideline
.
J Clin Endocrinol Metab
.
2010
;
95
:
4133
4160
.

11.

Bhasin
S
,
Cunningham
GR
,
Hayes
FJ
, .
Testosterone therapy in men with androgen deficiency syndromes: an Endocrine Society clinical practice guideline
.
J Clin Endocrinol Metab
.
2010
;
95
:
2536
2559
.

12.

Funder
JW
,
Carey
RM
,
Fardella
C
, .
Case detection, diagnosis, and treatment of patients with primary aldosteronism: an Endocrine Society Clinical Practice Guideline
.
J Clin Endocrinol Metab
.
2008
;
93
:
3266
3281
.

13.

Abalovich
M
,
Amino
N
,
Barbour
LA
, .
Management of thyroid dysfunction during pregnancy and postpartum: an Endocrine Society Clinical Practice Guideline
.
J Clin Endocrinol Metab
.
2007
;
92
:
S1
47
.

14.

Molitch
ME
,
Clemmons
DR
,
Malozowski
S
and
the Endocrine Society's Clinical Guidelines Subcommittee
.
Evaluation and treatment of adult growth hormone deficiency: an Endocrine Society Clinical Practice Guideline
.
J Clin Endocrinol Metab
.
2006
;
91
:
1621
1634
.

15.

August
GP
,
Caprio
S
,
Fennoy
I
, .
Prevention and treatment of pediatric obesity: an endocrine society clinical practice guideline based on expert opinion
.
J Clin Endocrinol Metab
.
2008
;
93
:
4576
4599
.

16.

Wierman
ME
,
Basson
R
,
Davis
SR
, .
Androgen therapy in women: an Endocrine Society Clinical Practice Guideline
.
J Clin Endocrinol Metab
.
2006
;
91
:
3697
3710
.

Supplementary data