Abstract

The level of evidence (LOE) grading system for ESC Clinical Practice Guidelines classifies the quality of the evidence supporting a recommendation. However, the current taxonomy does not fully consider the optimal study design necessary to establish evidence for such recommendations. Therefore, two separate taskforces of clinical and methodological experts were appointed by the Clinical Practice Guidelines Committee, with the first tasked with updating the LOE grading system for therapy and prevention, and the second responsible for developing a LOE grading system for diagnosis and prediction. The updated system for therapy and prevention presented here maintains the three-level grading structure but uses revised definitions. Level of evidence A represents conclusive evidence usually from ≥2 adequately powered randomized controlled trials (RCTs) free from major bias, with substantial evidence against the play of chance when combined in a meta-analysis (e.g. P < .005 for superiority). Additional criteria are specified to define substantial evidence against the play of chance in case of non-inferiority, equivalence, and harm. Level of evidence B is now subdivided into B1 and B2. Level of evidence B1 represents suggestive evidence usually from ≥1 adequately powered RCT free from major bias, or a meta-analysis of such RCTs, with some evidence against the play of chance (e.g. P < .05 for superiority). Level of evidence B2 represents limited evidence from ≥2 adequately powered non-randomized studies with careful control of major sources of bias or from a meta-analysis of small, underpowered RCTs. Level of evidence C represents preliminary evidence from either non-randomized studies without careful control of major sources of bias, a single small, underpowered RCT, or expert consensus.

The original and revised level of evidence grading systems. RCT, randomized controlled trial.
Graphical Abstract

The original and revised level of evidence grading systems. RCT, randomized controlled trial.

Introduction

The European Society of Cardiology (ESC) is a professional medical society that has published Clinical Practice Guidelines on the prevention, diagnosis and treatment of cardiovascular diseases since 1994.1 These cover a wide range of topics, including the management of stable and acute coronary syndromes, heart failure, arrhythmias, hypertension, valvular heart disease, congenital heart disease, and cardiovascular prevention. Each guideline is developed by a taskforce composed of experts in the relevant area, including, as necessary, cardiologists, surgeons, allied healthcare professionals, patients, clinical trialists, epidemiologists, and methodologists. Each guideline undergoes a rigorous review process involving experts and representatives from the 57 National Cardiac Societies of the ESC. A strict declaration of interest policy is enforced for both taskforce members and reviewers. All recommendations must receive at least 75% of the taskforce members’ votes in order to be approved.

The ESC uses a three-tiered level of evidence (LOE) grading system, which classifies the quality of the evidence supporting each recommendation in a guideline. The current criteria for each LOE are the following:

  • LOE A (the highest level): data from multiple randomized controlled trials (RCTs) or meta-analyses

  • LOE B: data derived from a single RCT or large non-randomized studies

  • LOE C (the lowest level): consensus of opinion of the experts and/or small studies, retrospective studies, registries

The LOE system was first published in 2000 as part of the ESC Guidelines on the management of acute coronary syndromes without persistent ST segment elevation2 based on written recommendations of the ESC Committee for Scientific and Clinical Initiatives, which were made available on ESC's website in the same year.1 Since then, the LOE system has been widely adopted and used in nearly all subsequent ESC guidelines without major adaptations.3

The current LOE system has limitations. It is applied to recommendations covering diverse aspects of cardiology, including therapy, prevention, diagnosis and prediction. However, the optimal study designs necessary to establish evidence for therapy and prevention vary from those necessary to establish evidence for diagnosis or prediction, where RCTs are not necessarily required to meet the highest LOE.

For therapy and prevention, the current LOE A does not specifically stipulate that meta-analyses must only include RCTs, does not consider the quality of meta-analyses,4 or the quality5,6 and statistical power7,8 of RCTs that support a recommendation, the extent of heterogeneity between RCTs,9 or whether there is strong evidence against the play of chance.10,11 As a result, two small, underpowered RCTs or a statistically heterogeneous meta-analysis of small RCTs can achieve LOE A, while a single, very large high-quality RCT will necessarily be classified as LOE B.

With the current system, LOE B could be achieved by two large observational studies irrespective of their quality. Given the immediate electronic availability of healthcare data, observational studies are easier to conduct nowadays in the era of ‘big data’, but this does not necessarily make them more reliable than previously when data accessibility was more limited. Observational studies continue to face methodological challenges, including confounding and selection bias,12 and usually cannot be relied on to provide the correct answer, particularly when the true effects of an intervention are small to moderate.13,14 The major limitation of the current criteria for LOE B is that it considers evidence from a single RCT and large observational studies as of equal value when assessing treatment effects, even though RCTs are designed to control for bias and confounding, while observational studies may yield misleading findings due to these factors, especially when data quality is low and/or methods are less than rigorous.

In view of the limitations of the current LOE grading system, the ESC CPG Committee appointed a methodology taskforce and requested the taskforce to revise the LOE grading system. The revision was done in two phases. The objective of the first phase was to revise the LOE concept to inform guideline recommendations for therapy and prevention, while the objective of the second phase was to develop a LOE grading system for diagnosis and prediction. During the first phase, the concept was collaboratively developed through a process that involved eight collective meetings of the taskforce members over a span of 2 years. The revised LOE system was subjected to an impact assessment to evaluate its influence on LOE in previous guidelines. The document underwent a rigorous peer review process involving expert methodologists and the CPG committee, and the final document was approved by the ESC Board.

Overview of the revised level of evidence grading system

The original and revised LOE grading systems are presented in Table 1. The revised system provides a standardized approach for evaluating the strength of evidence in ESC guidelines (Graphical Abstract).

Table 1

Original and revised levels of evidence grading systems

graphic
graphic

The table uses examples of superiority claims, for which the two-sided P-value for superiority should generally be <.005 for the outcome of interest, and the benefit should clearly exceed the known harms to reach level of evidence A. The criteria to establish substantial evidence against the play of chance for claims of non-inferiority, equivalence and harm are presented in Table 2. In the context of ESC guidelines, studies with 60 events or more, which have sufficient power to detect a large 50% relative reduction in the risk of the outcome of interest at a two-sided α of .05, are considered to be ‘adequately powered’. These studies are located in the upper region of funnel plots and are considerably less likely to be affected by small-study effects than smaller, inadequately powered studies. Therefore, they can serve as reference points to facilitate proper visual interpretation of funnel plots (see Supplementary data online, Web-appendix Section  S6). Further guidance on adequately powered studies is presented in the Explanatory Notes below and in Supplementary data online, Web-Appendix Section  S1, the criteria to establish substantial evidence against the play of chance in Table 2, and major sources of bias in Table 3 below.

aExceptionally, level of evidence A can be considered if (1) there is a single very large RCT that provides convincing evidence across a wide range of patients that is already widely accepted by most practitioners; or (2) the effects of the intervention are clinically obvious; in either case, > 90% of voting guideline taskforce members need to agree with level of evidence A.

bStudies do not need to individually show substantial evidence against the play of chance (e.g. a P-value of P < .005 for superiority) if a meta-analysis of these studies provides such evidence.

cStudies do not need to individually show some evidence against the play of chance (e.g. a P-value of <.05 for superiority) if a meta-analysis of these studies provides such evidence.

Table 1

Original and revised levels of evidence grading systems

graphic
graphic

The table uses examples of superiority claims, for which the two-sided P-value for superiority should generally be <.005 for the outcome of interest, and the benefit should clearly exceed the known harms to reach level of evidence A. The criteria to establish substantial evidence against the play of chance for claims of non-inferiority, equivalence and harm are presented in Table 2. In the context of ESC guidelines, studies with 60 events or more, which have sufficient power to detect a large 50% relative reduction in the risk of the outcome of interest at a two-sided α of .05, are considered to be ‘adequately powered’. These studies are located in the upper region of funnel plots and are considerably less likely to be affected by small-study effects than smaller, inadequately powered studies. Therefore, they can serve as reference points to facilitate proper visual interpretation of funnel plots (see Supplementary data online, Web-appendix Section  S6). Further guidance on adequately powered studies is presented in the Explanatory Notes below and in Supplementary data online, Web-Appendix Section  S1, the criteria to establish substantial evidence against the play of chance in Table 2, and major sources of bias in Table 3 below.

aExceptionally, level of evidence A can be considered if (1) there is a single very large RCT that provides convincing evidence across a wide range of patients that is already widely accepted by most practitioners; or (2) the effects of the intervention are clinically obvious; in either case, > 90% of voting guideline taskforce members need to agree with level of evidence A.

bStudies do not need to individually show substantial evidence against the play of chance (e.g. a P-value of P < .005 for superiority) if a meta-analysis of these studies provides such evidence.

cStudies do not need to individually show some evidence against the play of chance (e.g. a P-value of <.05 for superiority) if a meta-analysis of these studies provides such evidence.

Description of revised levels of evidence

Level of evidence A

Level of evidence A applies to conclusive evidence. For superiority claims, the benefits of an intervention should clearly exceed the known harms, a requirement that goes beyond simply achieving P < .005 for the outcome of interest. LOE A indicates that it is very unlikely that future research in the same patient population would refute the evidence.

In cardiovascular medicine, most interventions demonstrate a moderate benefit at best, with relative reductions of <25% for major clinical outcomes such as the composite of cardiovascular death, myocardial infarction, or stroke. However, in a few cases, interventions have been reliably shown to offer larger benefits, such as anticoagulation for secondary prevention of ischaemic strokes in patients with atrial fibrillation.15 To detect treatment effects that align with realistic expectations, studies must be randomized to ensure that the influence of any biases is substantially smaller than the treatment effect to be detected, and they must be adequately powered to ensure that random error is smaller than the expected treatment effect.

LOE A, therefore, usually requires at least two adequately powered RCTs, either analysed individually or combined in a meta-analysis. The totality of available adequately powered RCTs should provide substantial evidence against the play of chance (e.g. P < .005 for superiority) and should be of high quality and free of major sources of bias that could have meaningfully influenced the conclusions. The absolute benefits observed in these RCTs need to clearly outweigh the absolute risk of harm.

In the context of ESC guidelines, RCTs with 60 events or more that have sufficient power to detect a large 50% relative reduction in the risk of the outcome of interest are considered to be ‘adequately powered’. These RCTs are located in the upper region of funnel plots and are considerably less likely to be affected by small-study effects than smaller, inadequately powered studies. Therefore, they can serve as reference points to facilitate proper visual interpretation of funnel plots as explained in Supplementary data online, Web-appendix Section  S6. RCTs with <60 accrued events for the outcome of interest are almost certainly underpowered and more likely to be affected by small-study effects. Therefore, they should not form the basis for LOE A. The taskforce acknowledges that a considerably higher number of events is required to detect smaller relative risk reductions with adequate power. For example, around 900 accrued events are required for RCTs with binary outcomes of interest to have adequate power to detect a relative reduction of 15% if the risk of the outcome is around 30% in the control group. Further guidance on adequate power for RCTs is provided in the Explanatory Notes below and Supplementary data online, Web-Appendix Section  S1.

The taskforce chose a cut-off of <.005 to define substantial evidence against the play of chance for two-sided P-values for superiority. Guidance on assessing evidence against the play of chance in case of superiority, non-inferiority, equivalence, or harm is provided in the Explanatory Notes below and summarized in Table 2. The rationale for the chosen cut-offs is given in Supplementary data online, Web-Appendix Section  S2. The requirement of substantial evidence against the play of chance should usually be met by a meta-analysis of RCTs involving largely similar populations. If an intervention was found to be superior to a control intervention or placebo, for example, individual RCTs do not need to show a two-sided P < .005 for superiority to reach LOE A if a meta-analysis of these RCTs does. If a meta-analysis of adequately powered RCTs is unavailable in the published literature, it can be conducted by the methodologists of a guideline taskforce (e.g. a fixed-effects meta-analysis on a relative scale, for details see Supplementary data online, Web-Appendix Section  S3). In cases where a superiority analysis comparing an active intervention with placebo or usual care yields non-significant results despite adequate power, guideline taskforce members should inspect the 95% confidence interval to determine whether it satisfies criteria of non-inferiority or equivalence of placebo or usual care compared to the active intervention for LOE A.

Table 2

Definitions of substantial evidence against the play of chance by type of claim

Type of claimDefinition
Superiority2-sided P-value for superiority usually <.005.
Non-inferiorityUpper limit of 2-sided 95% confidence interval usually excludes 1.15 for unfavourable outcome on a relative scale (e.g. hazard ratio).
Equivalence2-sided 95% confidence interval usually excludes 0.87 and 1.15 on a relative scale (e.g. hazard ratio).
HarmPoint estimate indicates harm, lower limit of 2-sided 95% confidence interval excludes 1 for unfavourable outcome on a relative scale (e.g. hazard ratio), 2-sided P < .05.
Type of claimDefinition
Superiority2-sided P-value for superiority usually <.005.
Non-inferiorityUpper limit of 2-sided 95% confidence interval usually excludes 1.15 for unfavourable outcome on a relative scale (e.g. hazard ratio).
Equivalence2-sided 95% confidence interval usually excludes 0.87 and 1.15 on a relative scale (e.g. hazard ratio).
HarmPoint estimate indicates harm, lower limit of 2-sided 95% confidence interval excludes 1 for unfavourable outcome on a relative scale (e.g. hazard ratio), 2-sided P < .05.

In the context of ESC guidelines, the requirement of substantial evidence against the play of chance (e.g. a P-value of <.005 for superiority) should usually be met by a meta-analysis of adequately powered studies; individual studies do not need to show substantial evidence against the play of chance if a meta-analysis of these studies provides such evidence. The lower boundary of 0.87 to support equivalence corresponds to the inverse of 1.15 (1/1.15). Cardiovascular RCTs with continuous outcomes usually assess superiority or harm. In the uncommon scenario of establishing non-inferiority or equivalence for continuous outcomes, treatment effects and their 95% confidence intervals should be expressed in units of the pooled standard deviation of the outcome at follow-up. To establish substantial evidence against the play of chance when claiming non-inferiority or equivalence, the 95% confidence interval should exclude −0.2 and/or +0.2 standard deviation units.

Table 2

Definitions of substantial evidence against the play of chance by type of claim

Type of claimDefinition
Superiority2-sided P-value for superiority usually <.005.
Non-inferiorityUpper limit of 2-sided 95% confidence interval usually excludes 1.15 for unfavourable outcome on a relative scale (e.g. hazard ratio).
Equivalence2-sided 95% confidence interval usually excludes 0.87 and 1.15 on a relative scale (e.g. hazard ratio).
HarmPoint estimate indicates harm, lower limit of 2-sided 95% confidence interval excludes 1 for unfavourable outcome on a relative scale (e.g. hazard ratio), 2-sided P < .05.
Type of claimDefinition
Superiority2-sided P-value for superiority usually <.005.
Non-inferiorityUpper limit of 2-sided 95% confidence interval usually excludes 1.15 for unfavourable outcome on a relative scale (e.g. hazard ratio).
Equivalence2-sided 95% confidence interval usually excludes 0.87 and 1.15 on a relative scale (e.g. hazard ratio).
HarmPoint estimate indicates harm, lower limit of 2-sided 95% confidence interval excludes 1 for unfavourable outcome on a relative scale (e.g. hazard ratio), 2-sided P < .05.

In the context of ESC guidelines, the requirement of substantial evidence against the play of chance (e.g. a P-value of <.005 for superiority) should usually be met by a meta-analysis of adequately powered studies; individual studies do not need to show substantial evidence against the play of chance if a meta-analysis of these studies provides such evidence. The lower boundary of 0.87 to support equivalence corresponds to the inverse of 1.15 (1/1.15). Cardiovascular RCTs with continuous outcomes usually assess superiority or harm. In the uncommon scenario of establishing non-inferiority or equivalence for continuous outcomes, treatment effects and their 95% confidence intervals should be expressed in units of the pooled standard deviation of the outcome at follow-up. To establish substantial evidence against the play of chance when claiming non-inferiority or equivalence, the 95% confidence interval should exclude −0.2 and/or +0.2 standard deviation units.

RCTs that form the basis of LOE A should be of high quality and free of major sources of bias that could have influenced their findings in ways that might alter their interpretation. Table 3 (A) summarizes characteristics of high quality RCTs, all of which should usually be present, while Table 3 (B) summarizes characteristics of high-quality meta-analyses. Only RCTs should be summarized in a meta-analysis, excluding non-randomized studies, and the unexplained heterogeneity between RCTs should be low. Network meta-analyses should not be considered as the basis for LOE A (see Supplementary data online, Web-Appendix Section S5). Further guidance on RCTs and meta-analyses of RCTs is provided in Supplementary data online, Web-Appendix Sections  S3 to  S7.

Table 3

Characteristics of high-quality studies by study type

DomainCharacteristics that suggest high quality
(A) Randomized controlled trials
Patient allocationAdequate concealment of allocation; no major baseline imbalances as evidence suggestive of selection bias.
Implementation and adherenceAppropriate control; few deviations from intended interventions in terms of implementation and adherence.
Outcome assessmentOutcome of interest unlikely to be influenced by knowledge of the intervention received by participants (i.e. objective outcome) or outcome assessors unaware of the intervention received by participants.
Outcome dataLow proportion of missing data on outcome of interest; reasons for missing data similar across intervention groups.
ReportingEvidence that outcome of interest was analysed as pre-specified in the trial protocol before unblinded outcome data became available.
(B) Meta-analyses of randomized controlled trials
Study selectionPre-defined protocol, including a thorough search strategy to identify all eligible RCTs, and inclusion and exclusion criteria.
Study qualityRisk of bias of included RCTs carefully evaluated; if a systematic review also included non-randomized studies, meta-analysis separately pooling results of RCTs only.
Study sizeAt least two adequately powered RCTs available for analysis; funnel plot for considered outcome approximately symmetrical, ruling out relevant small study effects.
Unexplained heterogeneityI2 for unexplained heterogeneity between trials usually <50% for considered outcome.
(C) Non-randomized studies of interventions
Patient selectionAll eligible participants included; start of follow-up and start of intervention coincided in all participants; start of follow-up in control participants who did not receive an intervention defined comparably to start of follow-up in participants who received interventions.
Baseline characteristicsNo confounding expected or all known important confounding factors appropriately measured in all or almost all participants and adequately controlled for.
Classification of interventionsDefinition of interventions based only on information collected when interventions were started, not later.
Implementation, adherence, and co-interventionsFew or no deviations from intended interventions in terms of implementation and adherence; important co-interventions balanced across intervention groups.
Outcome assessmentMethods of outcome assessment comparable across intervention groups; outcome of interest unlikely to be influenced by knowledge of the intervention received by participants (i.e. objective outcome) or outcome assessors unaware of the intervention received by participants.
DataLow proportion of missing data on confounding factors, interventions, and outcome of interest; reasons for missing data similar across intervention groups.
ReportingEvidence, usually through examination of a pre-registered protocol or statistical analysis plan, that all reported results for an outcome of interest correspond to all intended analyses.
DomainCharacteristics that suggest high quality
(A) Randomized controlled trials
Patient allocationAdequate concealment of allocation; no major baseline imbalances as evidence suggestive of selection bias.
Implementation and adherenceAppropriate control; few deviations from intended interventions in terms of implementation and adherence.
Outcome assessmentOutcome of interest unlikely to be influenced by knowledge of the intervention received by participants (i.e. objective outcome) or outcome assessors unaware of the intervention received by participants.
Outcome dataLow proportion of missing data on outcome of interest; reasons for missing data similar across intervention groups.
ReportingEvidence that outcome of interest was analysed as pre-specified in the trial protocol before unblinded outcome data became available.
(B) Meta-analyses of randomized controlled trials
Study selectionPre-defined protocol, including a thorough search strategy to identify all eligible RCTs, and inclusion and exclusion criteria.
Study qualityRisk of bias of included RCTs carefully evaluated; if a systematic review also included non-randomized studies, meta-analysis separately pooling results of RCTs only.
Study sizeAt least two adequately powered RCTs available for analysis; funnel plot for considered outcome approximately symmetrical, ruling out relevant small study effects.
Unexplained heterogeneityI2 for unexplained heterogeneity between trials usually <50% for considered outcome.
(C) Non-randomized studies of interventions
Patient selectionAll eligible participants included; start of follow-up and start of intervention coincided in all participants; start of follow-up in control participants who did not receive an intervention defined comparably to start of follow-up in participants who received interventions.
Baseline characteristicsNo confounding expected or all known important confounding factors appropriately measured in all or almost all participants and adequately controlled for.
Classification of interventionsDefinition of interventions based only on information collected when interventions were started, not later.
Implementation, adherence, and co-interventionsFew or no deviations from intended interventions in terms of implementation and adherence; important co-interventions balanced across intervention groups.
Outcome assessmentMethods of outcome assessment comparable across intervention groups; outcome of interest unlikely to be influenced by knowledge of the intervention received by participants (i.e. objective outcome) or outcome assessors unaware of the intervention received by participants.
DataLow proportion of missing data on confounding factors, interventions, and outcome of interest; reasons for missing data similar across intervention groups.
ReportingEvidence, usually through examination of a pre-registered protocol or statistical analysis plan, that all reported results for an outcome of interest correspond to all intended analyses.

RCT, randomized controlled trial.

Considerations of characteristics of high-quality studies were based on RoB 2 for RCTs,6 ROBIS for meta-analyses,4 and ROBINS-I for non-randomized studies of interventions.12 See Supplementary data online, Web-Appendix Section  S4 for discussion of placebo interventions in device trials, Supplementary data online, Web-Appendix Section  S6 for a discussion of small study effects and funnel plots in meta-analyses, and Supplementary data online, Web-Appendix Section  S7 for measures of heterogeneity in meta-analyses other than I2.

Table 3

Characteristics of high-quality studies by study type

DomainCharacteristics that suggest high quality
(A) Randomized controlled trials
Patient allocationAdequate concealment of allocation; no major baseline imbalances as evidence suggestive of selection bias.
Implementation and adherenceAppropriate control; few deviations from intended interventions in terms of implementation and adherence.
Outcome assessmentOutcome of interest unlikely to be influenced by knowledge of the intervention received by participants (i.e. objective outcome) or outcome assessors unaware of the intervention received by participants.
Outcome dataLow proportion of missing data on outcome of interest; reasons for missing data similar across intervention groups.
ReportingEvidence that outcome of interest was analysed as pre-specified in the trial protocol before unblinded outcome data became available.
(B) Meta-analyses of randomized controlled trials
Study selectionPre-defined protocol, including a thorough search strategy to identify all eligible RCTs, and inclusion and exclusion criteria.
Study qualityRisk of bias of included RCTs carefully evaluated; if a systematic review also included non-randomized studies, meta-analysis separately pooling results of RCTs only.
Study sizeAt least two adequately powered RCTs available for analysis; funnel plot for considered outcome approximately symmetrical, ruling out relevant small study effects.
Unexplained heterogeneityI2 for unexplained heterogeneity between trials usually <50% for considered outcome.
(C) Non-randomized studies of interventions
Patient selectionAll eligible participants included; start of follow-up and start of intervention coincided in all participants; start of follow-up in control participants who did not receive an intervention defined comparably to start of follow-up in participants who received interventions.
Baseline characteristicsNo confounding expected or all known important confounding factors appropriately measured in all or almost all participants and adequately controlled for.
Classification of interventionsDefinition of interventions based only on information collected when interventions were started, not later.
Implementation, adherence, and co-interventionsFew or no deviations from intended interventions in terms of implementation and adherence; important co-interventions balanced across intervention groups.
Outcome assessmentMethods of outcome assessment comparable across intervention groups; outcome of interest unlikely to be influenced by knowledge of the intervention received by participants (i.e. objective outcome) or outcome assessors unaware of the intervention received by participants.
DataLow proportion of missing data on confounding factors, interventions, and outcome of interest; reasons for missing data similar across intervention groups.
ReportingEvidence, usually through examination of a pre-registered protocol or statistical analysis plan, that all reported results for an outcome of interest correspond to all intended analyses.
DomainCharacteristics that suggest high quality
(A) Randomized controlled trials
Patient allocationAdequate concealment of allocation; no major baseline imbalances as evidence suggestive of selection bias.
Implementation and adherenceAppropriate control; few deviations from intended interventions in terms of implementation and adherence.
Outcome assessmentOutcome of interest unlikely to be influenced by knowledge of the intervention received by participants (i.e. objective outcome) or outcome assessors unaware of the intervention received by participants.
Outcome dataLow proportion of missing data on outcome of interest; reasons for missing data similar across intervention groups.
ReportingEvidence that outcome of interest was analysed as pre-specified in the trial protocol before unblinded outcome data became available.
(B) Meta-analyses of randomized controlled trials
Study selectionPre-defined protocol, including a thorough search strategy to identify all eligible RCTs, and inclusion and exclusion criteria.
Study qualityRisk of bias of included RCTs carefully evaluated; if a systematic review also included non-randomized studies, meta-analysis separately pooling results of RCTs only.
Study sizeAt least two adequately powered RCTs available for analysis; funnel plot for considered outcome approximately symmetrical, ruling out relevant small study effects.
Unexplained heterogeneityI2 for unexplained heterogeneity between trials usually <50% for considered outcome.
(C) Non-randomized studies of interventions
Patient selectionAll eligible participants included; start of follow-up and start of intervention coincided in all participants; start of follow-up in control participants who did not receive an intervention defined comparably to start of follow-up in participants who received interventions.
Baseline characteristicsNo confounding expected or all known important confounding factors appropriately measured in all or almost all participants and adequately controlled for.
Classification of interventionsDefinition of interventions based only on information collected when interventions were started, not later.
Implementation, adherence, and co-interventionsFew or no deviations from intended interventions in terms of implementation and adherence; important co-interventions balanced across intervention groups.
Outcome assessmentMethods of outcome assessment comparable across intervention groups; outcome of interest unlikely to be influenced by knowledge of the intervention received by participants (i.e. objective outcome) or outcome assessors unaware of the intervention received by participants.
DataLow proportion of missing data on confounding factors, interventions, and outcome of interest; reasons for missing data similar across intervention groups.
ReportingEvidence, usually through examination of a pre-registered protocol or statistical analysis plan, that all reported results for an outcome of interest correspond to all intended analyses.

RCT, randomized controlled trial.

Considerations of characteristics of high-quality studies were based on RoB 2 for RCTs,6 ROBIS for meta-analyses,4 and ROBINS-I for non-randomized studies of interventions.12 See Supplementary data online, Web-Appendix Section  S4 for discussion of placebo interventions in device trials, Supplementary data online, Web-Appendix Section  S6 for a discussion of small study effects and funnel plots in meta-analyses, and Supplementary data online, Web-Appendix Section  S7 for measures of heterogeneity in meta-analyses other than I2.

The methodology taskforce recognizes that cluster RCTs are appropriate for evaluating whether hospital-wide or system-wide changes affect patient outcomes but recommends that cluster RCTs should only be considered as the basis for LOE A if they ensured adequate concealment of allocation. Cluster RCTs lacking adequate concealment of allocation—including many parallel-group cluster RCTs with inclusion of incident cases, crossover cluster RCTs without central identification of eligible patients, and most stepped-wedge cluster trials—should generally not be relied on for LOE A as they are prone to selection bias.16 A more in-depth discussion of this is beyond the scope of this document.

On rare occasions, a guideline taskforce may consider LOE A for recommendations that fall under one of the following two exceptions:

Exception 1: a single very large, well-performed RCT provides conclusive evidence

In rare cases, a single very large RCT can provide a definitive result that, on its own, is accepted by most practitioners as sufficient to warrant a major change in clinical practice. Additional adequately powered RCTs are therefore unlikely to be conducted as it may be considered unethical to conduct them. The RCT should provide substantial evidence against the play of chance and should be both of high quality and free of major sources of bias.

Exception 2: clinical effectiveness is obvious

In a few cases, a guideline taskforce may wish to create a recommendation with a LOE A classification for an intervention whose benefit is clinically obvious, pathophysiologically and pharmacologically consistent with current understanding, and has the potential to significantly improve the prognosis of patients who would otherwise experience serious outcomes. The intervention has been widely accepted as effective and therefore no adequately powered RCTs have ever been conducted or are likely to be conducted for ethical reasons. Such interventions may or may not be supported indirectly by in vitro, animal, or Mendelian randomization studies. Examples include smoking cessation, cardiac pacing for acquired complete atrio-ventricular block, and defibrillation for ventricular fibrillation in cardiac arrest.

The methodology taskforce stresses that these exceptions should rarely be used and when used there should be agreement on the corresponding recommendation by >90% of the voting guideline taskforce members. A note should be included in the corresponding recommendation table to explain the rationale for the exception.

Level of evidence B1

Level of evidence B1 applies to evidence that is suggestive but not conclusive, either from multiple adequately powered RCTs if the combined findings of these RCTs in a high-quality meta-analysis provide some but not substantial evidence against the play of chance (e.g. P < .05 in case of superiority), or from a single adequately powered RCT with statistically significant results in case of superiority claims (P < .05). In analogy to superiority claims, the methodology taskforce suggests that somewhat wider 95% confidence intervals than those specified in Table 2 would be acceptable to support LOE B1 for non-inferiority or equivalence claims. The RCTs should be of high quality and therefore free of major sources of bias that might have influenced the findings meaningfully. Further research, if conducted, could conceivably result in changes of the recommendation.

Level of evidence B2

Level of evidence B2 applies to evidence that is limited because it is derived solely from high-quality non-randomized studies or from a meta-analysis of small, inadequately powered RCTs. Most claims of large effects arise from small RCTs or non-randomized studies and often turn out to be false on further scientific examination.8,17 These should therefore be critically assessed. However, the methodology taskforce acknowledges that, in selected cases, non-randomized studies can provide valuable information, particularly in case of rare diseases and long-term outcomes, which may not be as readily addressed in RCTs. LOE B2 may therefore be considered if multiple adequately powered non-randomized studies of interventions meet most or all the characteristics summarized in Table 3 (C) and provide substantial evidence against the play of chance (P < .005 in case of superiority) when either analysed individually or combined in a meta-analysis. When harm is observed in non-randomized studies of interventions, a less stringent threshold (P < .05) can be used as evidence against the play of chance (Table 2).

Since non-randomized studies are vulnerable to selection bias, confounding, and misclassification of interventions,12 the initial three characteristics in Table 3 (C) suggesting high quality concerning patient selection, measurement and adjustment for baseline characteristics, and classification of interventions must always be present in a non-randomized study of interventions used as the basis for LOE B2.

Comparative effectiveness analyses using large clinical databases or administrative registries often yield low P-values in favour of new interventions. However, they are frequently plagued by unmeasured confounding introduced by hospitals or medical professionals who selectively prescribe the latest drugs or implant the latest devices, as well as by patients who are interested in or can afford them. Since such unmeasured confounding cannot be adjusted for, such studies do not generally meet the quality standards specified in Table 3 (C) and should rarely be classified as LOE B2. Further guidance on patient selection, measurement and adjustment for baseline characteristics, and classification of interventions in non-randomized studies of interventions is provided in Supplementary data online, Web-Appendix Section  S8.

LOE B2 can also be based on a meta-analysis of multiple small and inadequately powered RCTs (see Supplementary data online, Table S1) with statistically significant result (e.g. P < .05 for superiority). The meta-analysis should meet the first two and the last characteristic concerning trial selection, trial quality and unexplained heterogeneity between trials in Table 3 (B). Since the meta-analysis only includes small, underpowered RCTs it is susceptible to small-study effects.

Level of evidence C

Level of evidence C is the lowest LOE and represents only preliminary evidence that does not meet the criteria for LOE A, B1 or B2. These may include a single, inadequately powered RCT, non-randomized studies, and/or expert consensus. LOE C implies that there is a need for adequately powered, well-performed RCTs to determine the true effects of the intervention.

Explanatory notes

Guidance on adequate power

If a treatment truly has only a moderate benefit, a small RCT can only attain statistical significance if it produces an effect estimate greater than the true effect purely by chance. Achieving statistical significance often leads to a higher likelihood of presentation and publication, which results in smaller published RCTs tending to report larger treatment effects than larger RCTs. This phenomenon is known as publication bias.18 Other reasons why smaller, underpowered RCTs tend to show larger treatment effects compared with larger, adequately powered RCTs include selective outcome reporting, and methodological limitations that are more common in smaller RCTs.19,20 To protect against such small-study effects, at least two adequately powered RCTs, which involve largely similar populations, are now required to substantiate LOE A for a recommendation. The information provided in Supplementary data online, Web-appendix Section  S1 serves as further guidance for the required number of accrued outcome events for binary outcomes (e.g. death) and the number of included patients for continuous outcomes (e.g. blood pressure) for a 2-arm trial to have adequate power to detect different magnitudes of treatment effects. This guidance does not apply to cluster RCTs as the number of outcome events required in cluster RCTs may need to be significantly larger compared to standard RCTs due to the correlation of outcomes among participants within clusters. Therefore, guideline taskforces should seek guidance from a methodologist with experience in cardiovascular RCTs when categorizing cluster RCTs based on their size.

Guidance on assessing evidence against the play of chance

In the context of ESC guidelines, a two-sided P-value of <.005 is considered substantial evidence against the play of chance when claiming superiority of an intervention.10,11 Higher P-values are usually associated with an unacceptably high proportion of positive claims that turn out to be false, as explained in Supplementary data online, Web-Appendix Section  S2.10 A two-sided P-value of superiority of .005 in frequentist statistics approximately corresponds to a Bayesian posterior probability of superiority of 99.75% if a minimally informative prior is used, while a two-sided P-value of .05 corresponds to an approximate posterior probability of 97.5%.21 It is important to note that, from a clinical perspective, 95% confidence intervals provide insight into the range of values within which the true effect may lie,22 but the upper limit of the 95% confidence interval should not be relied on to determine whether there is substantial evidence against the play of chance in claims of superiority (see Supplementary data online, Web-Appendix Section  S2).

For non-inferiority trials, substantial evidence against the play of chance is indicated by an upper limit of the two-sided 95% confidence interval that usually excludes 1.15 for unfavourable outcomes, such as the composite of cardiovascular death, myocardial infarction, or stroke on a relative scale (hazard ratio, risk ratio, rate ratio, odds ratio). For equivalence trials, the two-sided 95% confidence interval should exclude both 0.87 and 1.15 on a relative scale. For other outcomes that are less critical, such as the occurrence of chest pain, it might be considered acceptable to have somewhat wider confidence intervals. The taskforce acknowledges that regulators advise that non-inferiority margins should be based on historical evidence for the active comparator. A common practice in cardiovascular RCTs is to use 50% of the minimally possible benefit found for the active comparator. The minimally possible benefit is then defined by the upper limit of the 95% confidence interval of the treatment effect found for the active comparator. This approach results in high variability of margins, sometimes resulting in margins that are overly liberal, as was the case for novel anticoagulants, where it yielded a non-inferiority margin of 1.38 on a hazard ratio scale.23,24

When addressing non-inferiority and equivalence in the context of ESC guidelines, confidence intervals should always be assessed on a relative scale, as confidence intervals of risk differences are unduly narrow if event rates are low (see Supplementary data online, Web-Appendix Section  S9). If the prior used in Bayesian RCTs was minimally informative, the limits of 0.87 and 1.15 can also be applied to 95% credibility intervals. When a superiority analysis comparing an active intervention with placebo or usual care yields non-significant results despite adequate power, guideline taskforce members should inspect the 95% confidence interval to determine whether it satisfies criteria of non-inferiority or equivalence of placebo or usual care compared to the active intervention for LOE A. Lastly, it is challenging to interpret P-values in non-inferiority and equivalence trials, as they hinge on the chosen margins to establish non-inferiority or equivalence. Therefore, they should not be used to determine the LOE in the context of ESC guidelines.

To demonstrate substantial evidence against the play of chance in case of harm, the point estimate should suggest harm, and the two-sided P-value should usually be <.05. This is equivalent to the lower limit of the two-sided 95% confidence interval excluding 1 for an unfavourable outcome on a relative scale. The difference in thresholds for P-values required for claims of benefit vs. harm arises from the potential consequences of false positives in each scenario. When claiming benefit, the lower threshold of .005 is chosen to minimize the chance of falsely claiming a beneficial effect when it might not exist. When claiming harm, a higher threshold of .05 is used to minimize the chance of missing harmful effects.

Composite outcomes that combine benefits and harms into a ‘net benefit’, such as net adverse clinical events,25 aim at providing an assessment of the balance between the benefits and harms of an intervention.26 Interpreting non-inferiority for a net benefit outcome on its own presents challenges because its clinical implications depend on the nature and clinical significance of the observed benefits and harms. In the context of ESC guidelines, achieving non-inferiority in net benefit should be accompanied by non-inferiority in both benefit and harm outcomes, both supported by substantial evidence against the play of chance as detailed in Table 2. If there is superiority in net benefit with substantial evidence against the play of chance, there are no additional requirements.

Class III recommendations typically pertain to interventions that are not recommended based on evidence suggesting that these interventions are likely to be ineffective or potentially harmful. This implies the presence of evidence indicating either equivalence or harm when compared with a control intervention. Therefore, when assessing the intervention of interest for a Class III recommendation, the criteria summarized in Table 2 should be used to establish substantial evidence against the play of chance in cases of equivalence or harm when compared with control.

Supplementary data online, Web-Appendix Section  S10 provides guidance on how to deal with multiplicity if an outcome of interest of a recommendation is not the primary outcome of an RCT or a meta-analysis, and considerations for interpreting treatment effects in specific patient subgroups.

Other considerations

There will be situations where assigning LOE will be difficult; therefore, the guidelines must rely on taskforce members to determine appropriate classifications. Such challenges can arise from extrapolating evidence from individual drugs to drug classes or from trial populations to other, similar populations. Supplementary data online, Web-Appendix Section  S11 provides considerations regarding drug class effects. Rare diseases pose a particularly difficult situation for evidence development. Most rare diseases lack approved therapies, and patients may be hesitant to participate in RCTs with a placebo arm. In cases where an approved therapy exists, RCTs may require participants to discontinue that treatment, which patients may be reluctant to do. In severe or rapidly progressive diseases, using a placebo control may also raise ethical concerns. Even when placebo-controlled RCTs are feasible, they will inevitably involve small sample sizes. Therefore, regulatory authorities may in some cases accept other types of evidence, including high quality pharmacokinetic and pharmacodynamic studies. It was beyond the scope of this document to address such situations, and guideline taskforces should seek specific guidance in such cases.

Association between class of recommendation and level of evidence

The class of recommendation (COR) should generally align with the LOE, with COR I and III more frequently supported by LOE A than COR IIa and IIb. However, ESC guidelines also provide clinical guidance in situations with limited evidence. Especially in the context of rare diseases, conducting large RCTs that could attain LOE A can be challenging, as discussed above; yet, it may be appropriate to make a class I recommendation for certain interventions. The methodology taskforce acknowledges this but also encourages ESC guideline taskforces to limit the number of class I recommendations supported by LOE B2 or C if RCTs could be conducted.

The methodology taskforce notes that substantial evidence against the play of chance does not automatically justify a COR I. The decision to assign a COR I is based on a comprehensive assessment that also includes the absolute magnitude of the observed benefit, such as the number-needed-to-treat (NNT), the type of outcome, and the balance between benefit and harm of an intervention. In instances where there is substantial evidence against the play of chance, but the observed benefit is modest, it is the responsibility of the guideline taskforce to evaluate whether the observed effect is sufficiently clinically relevant and the benefit-risk balance sufficiently favourable to warrant a COR I, or whether COR IIa or IIb should be considered. Finally, although not directly a consequence of the LOE classification, it is important to emphasize that a given LOE only applies to the patient populations similar to those included in the studies that form the basis for the LOE classification.

Key resources

The methodology taskforce acknowledges that it is beyond the scope of this document and its accompanying Supplementary data online, Web-Appendix to provide full guidance on assessing and interpreting different study designs for guideline recommendations on treatment and prevention. Supplementary data online, Web-Appendix Section  S12 therefore presents key resources for guideline taskforce members to consider when assessing evidence on treatment or prevention. In addition, trained methodologists should be included in every taskforce to guide evidence appraisal. As a key resource for guideline taskforce members, the guidance provided in the document and Supplementary data online, Web-Appendix will be modified and expanded as needed in the future.

Discussion

Since 2000, the ESC has utilized a three-tiered LOE grading system to assess the quality of evidence supporting the recommendations in all guidelines. While the grading system has served many ESC guidelines well, it also has important limitations that have led to uncertainty in the application to evidence. Recognizing these limitations, the ESC CPG Committee appointed a methodology taskforce to revise the LOE grading system. In phase 1, a new system for therapy and prevention was developed. The taskforce acknowledged the widespread acceptance of ESC's three-tiered system, which has been used in ESC guidelines for over 20 years. The newly proposed system, therefore, retains three levels of evidence A to C, but subdivides LOE B newly into B1 and B2, where B1 connotes a higher LOE than B2.

The revised system provides clearer criteria for each LOE, considering factors such as adequate power, evidence against the play of chance, and the absence of major bias. Exceptions are made for cases where a single large RCT provides definitive evidence or when the clinical effectiveness of an intervention is obvious. The proposed system aims to improve the reliability and accuracy of guideline recommendations in cardiovascular medicine. Explanatory notes and a Supplementary data online, Web-Appendix provides additional information to enable guideline taskforce members to understand the rationale of the grading system and use it appropriately.

The revised LOE grading system for therapy and prevention will lead to downgrading of the evidence for some recommendations from LOE A to B1 or B2, and from B to B2 or C, whereas few recommendations will achieve a higher LOE. An impact assessment of recently developed guidelines performed by the methodology taskforce confirmed that the changes in grading were relatively few. A large proportion of LOE B or C supporting guideline recommendations is often perceived as a weakness. However, it is essential to understand that this is not a shortcoming of the guidelines themselves but rather a reflection of the limitations of the underlying evidence. These limitations become more evident when applying the revised LOE grading system.

Limitations of the revised level of evidence grading system

The focus of the revision has been on traditional interventions, including drugs, devices, and procedures. They do not address new interventions such as gene therapies, which may address the genetic cause of a disease. As more experience is gained on this type of intervention, the grading system might need to be adapted.

The methodology taskforce acknowledges that studies sometimes can offer compelling evidence for a particular intervention, even if they do not meet the specific criteria outlined for LOE A or B1. While considering how to accommodate such scenarios within a grading system, the taskforce prioritized simplicity, opting to maintain the familiar three-tier classification system widely recognized by cardiologists. Furthermore, the taskforce emphasized stringent criteria, recognizing that this approach might sometimes result in assigning a lower LOE than what could be inferred from the overall body of evidence. The taskforce preferred to err on the side of caution rather than assigning an inappropriately high LOE to a recommendation. These priorities were closely linked to the need for a grading system that can be relatively easily used by clinical experts who contribute to the development of ESC clinical practice guidelines.

It is important to note that the LOE provided here, along with their definitions and supplementary background information, may not offer a clear classification in every instance. There will be areas open to interpretation, which is an inherent challenge. While the class of recommendation should generally align with the LOE, other considerations can and should be factored in. Each guideline’s taskforce will need to offer recommendations and guidance based on their best assessment of the evidence and their understanding of clinical realities. The ESC guideline development process, which involves the careful selection of experts for each taskforce and a rigorous review process, ensures that the final recommendations and corresponding LOE are robust and well-informed.

In conclusion, the ESC has recognized the limitations of the current LOE grading system and developed a new standardized approach to evaluate the strength of evidence for therapy and prevention in their guidelines. The proposed system provides more specific criteria for each LOE, taking into account the quality and size of studies, evidence against the play of chance, and the presence of bias.

Acknowledgements

Members of the ESC Clinical Practice Guidelines Committee and a selection of external expert reviewers were invited to review this document. Reviewers who agreed to be acknowledged for their contribution are: Deepak L. Bhatt, Louise Bowman, Robert A. Byrne, Christopher Cannon, Victoria Delgado, Robert P. Giugliano, Robert Harrington, John P. A. Ioannidis, Juan Pablo Kaski, Lars Køber, Teresa López-Fernández, Borja Ibanez, Alexander R. Lyon, Nikolaus Marx, John William McEvoy, John J. V. McMurray, Julinda Mehilli, Marco Metra, Borislava Mihaylova, Richard Mindham, Agnes A. Pasquet, Amina Rakisheva, Bianca Rocca, Marc S. Sabatine, Henrik Toft Sorensen, Jacob Tfelt-Hansen, Christiaan J. M. Vrints, Adam Witkowski, Faiez Zannad, Katja Zeppenfeld.

Supplementary data

Supplementary data are available at European Heart Journal online.

Declarations

Disclosure of Interest

S.A. reports receiving speaker honoraria from Novartis, Daiichi, Astra Zenecca, BMS, and Pfizer. E.A. reports payments or honoraria from Biosense Webster, Medtronic, and Bristol-Myers-Squibb. E.A. is a member of the Board of the European Heart Rhythm Association. S.B. reports that he is a DSMB member of the EVERGREEN Study – Investigator Initiated study (NCT05479305) and a DSMB statistician of the IONMAN Study – (NCT06071702). M.C. reports investigator-initiated research grants to her institution from Novartis, Abbott, and Pfizer. M.C. reports clinical study contracts between her institution and NovoNordisk and CorVia. M.C. receives support for attending meetings from Abbott, NovoNordisk, Pfizer, and Teva Pharmaceutical Industries. M.C. declares payment or honoraria for lectures, presentations, speakers' bureaus, manuscript writing,or educational events from Abbott, Abiomed, Amgen, Amicus Therapeutics, Astra Zeneca, Bayer, Boehringer-Ingelheim, GE Healthcare, Krka Pharma, Novartis, NovoNordisk, Pfizer, Roche, Swixx, Takeda, Teva Pharmaceutical Industries, and Viatris. M.C. declares participation on Bristol Meyers Squibb, NovoNordisk, Pulsify Medical Data Safety Monitoring Boards or Advisory Boards. L.F. reports consulting fees from Bayer, BMS/Pfizer, Boehringer Ingelheim, Medtronic, Novartis, Novo, and XO and speaker fees from AstraZeneca, Bayer, BMS/Pfizer, Boehringer Ingelheim, Medtronic, Novartis, and Zoll. C.P.G. reports grants from or contracts with Alan Turing Institute, British Heart Foundation, National Institute for Health Research, Horizon 2020, Abbott Diabetes, Bristol Myers Squibb, and the European Society of Cardiology. C.P.G. receives consulting fees from AI Nexus, AstraZeneca, Amgen, Bayer, Bristol Myers Squibb, Boehringer-Ingelheim, CardioMatics, Chiesi, Daiichi Sankyo, GPRI Research B.V., Menarini, Novartis, iRhythm, Organon, The Phoenix Group and payments or honoraria from AstraZeneca, Boston Scientific, Menarini, Novartis, Raisio Group, Wondr Medical, Zydus, as well as support for attending meetings from AstraZeneca. C.P.G. reports participation on a Data Safety Monitoring Board or Advisory Board on the DANBLCOK trial and the TARGET CTCA trial. C.P.G. declares his roles as deputy Editor: EHJ Quality of Care and Clinical Outcomes, NICE Indicator Advisory Committee Member, and Chair of the ESC Quality Indicator Committee. C.P.G. reports stock or stock options in CardioMatics and receipt of equipment, materials, drugs, medical writing, gifts, or other services from Kosmos device. S.H. reports receiving speaker honoraria from Novartis, Sanofi, Bristol-Myers Squibb, and Astra Zeneca for topics not related to the content of this manuscript. S.J. declares proctor fees from Medtronic and payments or honoraria to his institution from Novo Nordisk and Amgen. S.J. declares participation on the New Amsterdam Data Safety Monitoring or Advisory Board with compensation to his institution. K.C.K. reports payments or honoraria from Amgen, Sanofi, Daiichi Sankyo. D.K. reports the following grants to his institution: National Institute for Health Research (NIHR130280 DaRe2THINK; NIHR132974 D2T-NeuroVascular; NIHR203326 Biomedical Research Centre), British Heart Foundation (AA/18/2/34218 and FS/CDRF/21/21032), EU/EFPIA Innovative Medicines Initiative (BigData@Heart 116074), EU Horizon and UKRI (HYPERMARKER 101095480), UK National Health Service -Data for R&D- Subnational Secure Data Environment programme (West Midlands), UK Department for Business, Energy & Industrial Strategy Regulators Pioneer Fund, Cook & Wolstenholme Charitable Trust, Bayer, Amomed, Protherics Medicines Development. D.K. reports educational grants to the European Society of Cardiology from Boehringer Ingelheim/BMS-Pfizer Alliance/Bayer/Daiichi Sankyo/Boston Scientific, the NIHR/University of Oxford Biomedical Research Centre and British Heart Foundation/University of Birmingham Accelerator Award (STEEER-AF). D.K. reports consulting fees: Bayer, Advisory board; Amomed, Advisory board; Protherics Medicines Development, Advisory board. D.K. declares participation in the ATEMPT trial, Data Safety Monitoring Board (no payments) and COLOUR-COPD, Trial Steering Board (no payments). D.K. reports being Chair of the European Society of Cardiology 2024 AF Guidelines (no payments), and Chair of the Heart Failure Association Atrial Disorders Committee (no payments). U.L. reports research grants with payment to his institution from Abbott, Bayer, Novartis, consulting fees with payment to his institution from Amgen and Sanofi, and payments or honoraria with payment to his institution from Novartis. B.S.L. reports consulting fees from Janssen R&D and Idorsia. M.-L.L. reports lecture fees from Bayer, Sanofi and BMS/Pfizer. J.C.N. reports his unpaid role as chair of the DSMB in the PROTECT-HF trial and his role as executive editor for Europace. E.B.P. reports participation on DSMB, Research Services, University of Oxford (payment to institution). C.B. reports grants or contracts from Boehringer Ingelheim to provides support through a grant to the University of Oxford for the EMPA-KIDNEY trial, from the Medical Research Council for his position as Director of the Population Health Research Unit 2019-24 and the Therapy Acceleration Laboratory Award 2021, from the NIHR HTA for 17/140/02: cost-effectiveness of statin therapies evaluated using individual participant data from large randomised clinical trials. (co-applicant) 2019-22 and from the Health Data Research UK for the Substantive Site award (co-applicant); 2018-23. C.B. participated as chair in the Phase II trial of Factor X1-Inhibitor for Dialysis Patients (unpaid work) supported by Merck and as chair of the Prepare for Kidney Care trial (unpaid work) supported by NIHR HTA. P.J., B.R.d.C., X.R., and I.V. have nothing to declare.

Data Availability

No data were generated or analysed for this manuscript.

Funding

All authors declare no funding for this contribution.

References

1

Klein
 
WW
.
Current and future relevance of guidelines
.
Heart
 
2002
;
87
:
497
500
.

2

Bertrand
 
ME
,
Simoons
 
ML
,
Fox
 
KA
,
Wallentin
 
LC
,
Hamm
 
CW
,
McFadden
 
E
, et al.  
Management of acute coronary syndromes: acute coronary syndromes without persistent ST segment elevation; recommendations of the task force of the European Society of Cardiology
.
Eur Heart J
 
2000
;
21
:
1406
32
.

3

Tantawy
 
M
,
Marwan
 
M
,
Hussien
 
S
,
Tamara
 
A
,
Mosaad
 
S
.
The scale of scientific evidence behind the current ESC clinical guidelines
.
IJC Heart Vasc
 
2023
;
45
:
101175
.

4

Whiting
 
P
,
Savović
 
J
,
Higgins
 
JPT
,
Caldwell
 
DM
,
Reeves
 
BC
,
Shea
 
B
, et al.  
ROBIS: a new tool to assess risk of bias in systematic reviews was developed
.
J Clin Epidemiol
 
2016
;
69
:
225
34
.

5

Jüni
 
P
,
Altman
 
DG
,
Egger
 
M
.
Systematic reviews in health care: assessing the quality of controlled clinical trials
.
BMJ
 
2001
;
323
:
42
6
.

6

Sterne
 
JAC
,
Savović
 
J
,
Page
 
MJ
,
Elbers
 
RG
,
Blencowe
 
NS
,
Boutron
 
I
, et al.  
Rob 2: a revised tool for assessing risk of bias in randomised trials
.
BMJ
 
2019
;
366
:
l4898
.

7

Yusuf
 
S
,
Collins
 
R
,
Peto
 
R
.
Why do we need some large, simple randomized trials?
 
Stat Med
 
1984
;
3
:
409
22
.

8

Egger
 
M
,
Davey Smith
 
G
.
Misleading meta-analysis. Lessons from “an effective, safe, simple” intervention that wasn’t
.
BMJ
 
1995
;
310
:
752
4
.

9

Thompson
 
SG
.
Why sources of heterogeneity in meta-analysis should be investigated
.
BMJ
 
1994
;
309
:
1351
5
.

10

Sterne
 
JA
,
Davey Smith
 
G
.
Sifting the evidence-what’s wrong with significance tests?
 
BMJ
 
2001
;
322
:
226
31
.

11

Ioannidis
 
JPA
.
The proposal to lower P value thresholds to.005
.
JAMA
 
2018
;
319
:
1429
.

12

Sterne
 
JA
,
Hernán
 
MA
,
Reeves
 
BC
,
Savović
 
J
,
Berkman
 
ND
,
Viswanathan
 
M
, et al.  
ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions
.
BMJ
 
2016
;
355
:
i4919
.

13

Collins
 
R
,
Bowman
 
L
,
Landray
 
M
,
Peto
 
R
.
The magic of randomization versus the myth of real-world evidence
.
N Engl J Med
 
2020
;
382
:
674
8
.

14

VanderWeele
 
TJ
,
Ding
 
P
.
Sensitivity analysis in observational research: introducing the E-value
.
Ann Intern Med
 
2017
;
167
:
268
74
.

15

EAFT (European Atrial Fibrillation Trial) Study Group
.
Secondary prevention in non-rheumatic atrial fibrillation after transient ischaemic attack or minor stroke
.
Lancet
 
1993
;
342
:
1255
62
.

16

Giraudeau
 
B
,
Ravaud
 
P
.
Preventing bias in cluster randomised trials
.
PLoS Med
 
2009
;
6
:
e1000065
.

17

Lawlor
 
DA
,
Davey Smith
 
G
,
Kundu
 
D
,
Bruckdorfer
 
KR
,
Ebrahim
 
S
.
Those confounded vitamins: what can we learn from the differences between observational versus randomised trial evidence?
 
Lancet
 
2004
;
363
:
1724
7
.

18

Easterbrook
 
PJ
,
Berlin
 
JA
,
Gopalan
 
R
,
Matthews
 
DR
.
Publication bias in clinical research
.
Lancet
 
1991
;
337
:
867
72
.

19

Egger
 
M
,
Davey Smith
 
G
,
Schneider
 
M
,
Minder
 
C
.
Bias in meta-analysis detected by a simple, graphical test
.
BMJ
 
1997
;
315
:
629
34
.

20

Nüesch
 
E
,
Trelle
 
S
,
Reichenbach
 
S
,
Rutjes
 
AW
,
Tschannen
 
B
,
Altman
 
DG
, et al.  
Small study effects in meta-analyses of osteoarthritis trials: meta-epidemiological study
.
BMJ
 
2010
;
341
:
c3515
.

21

Greenland
 
S
.
Re: “P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate”
.
Am J Epidemiol
 
1994
;
139
:
116
7
.

22

Burton
 
PR
,
Gurrin
 
LC
,
Campbell
 
MJ
.
Clinical significance not statistical significance: a simple Bayesian alternative to p values
.
J Epidemiol Community Health
 
1998
;
52
:
318
23
.

23

Althunian
 
TA
,
de Boer
 
A
,
Groenwold
 
RHH
,
Klungel
 
OH
.
Defining the noninferiority margin and analysing noninferiority: an overview
.
Br J Clin Pharmacol
 
2017
;
83
:
1636
42
.

24

Granger
 
CB
,
Alexander
 
JH
,
McMurray
 
JJV
,
Lopes
 
RD
,
Hylek
 
EM
,
Hanna
 
M
, et al.  
Apixaban versus warfarin in patients with atrial fibrillation
.
N Engl J Med
 
2011
;
365
:
981
92
.

25

Stone
 
GW
,
Witzenbichler
 
B
,
Guagliumi
 
G
,
Peruga
 
JZ
,
Brodie
 
BR
,
Dudek
 
D
, et al.  
Bivalirudin during primary PCI in acute myocardial infarction
.
N Engl J Med
 
2008
;
358
:
2218
30
.

26

Sawaya
 
GF
,
Guirguis-Blake
 
J
,
LeFevre
 
M
,
Harris
 
R
,
Petitti
 
D
.
Update on the methods of the U.S. Preventive services task force: estimating certainty and magnitude of net benefit
.
Ann Intern Med
 
2007
;
147
:
871
5
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)

Supplementary data