Efficacy of experimental treatments compared with standard treatments in non-inferiority trials: a meta-analysis of randomized controlled trials

Background There is concern that non-inferiority trials might be deliberately designed to conceal that a new treatment is less effective than a standard treatment. In order to test this hypothesis we performed a meta-analysis of non-inferiority trials to assess the average effect of experimental treatments compared with standard treatments. experimental treatments. Further studies are required to examine the importance of such bias.


Introduction
Non-inferiority trials are increasingly published in the medical literature, increasingly used in drug licensing and have at the same time come under increased scrutiny and criticism, up to the allegation that they are unethical. [1][2][3][4][5][6][7][8][9][10][11][12][13] A verdict of 'non-inferiority' leaves readers with the impression that a new experimental treatment is as good as an established standard treatment and that the two can be used interchangeably. However, in such trials, non-inferiority is statistically accepted whenever an experimental treatment is unlikely to be worse than an established treatment by more than a pre-specified amount, the so-called non-inferiority margin. If a relatively wide margin is chosen, new treatments that are actually less beneficial might wrongly be considered as equally effective. This may lead to acceptance and use of new therapies that are actually less effective in a clinically relevant way. 10,12 There is concern that non-inferiority trials might be deliberately designed to conceal that a new treatment is somewhat less effective than a standard treatment. 10,12 Systematic use of too-large non-inferiority margins or systematic biases of design, conduct or reporting of non-inferiority trials may skew results in favour of new treatments. [14][15][16][17][18][19] In this metaanalysis we examined one type of systematic bias. If trialists systematically compare slightly less effective new treatments with standard treatments, the combined results from a meta-analysis of many trials in which experimental treatments gain a verdict of non-inferiority, would be expected to favour the standard treatment. In order to test this hypothesis, we performed a meta-analysis of non-inferiority trials published in clinical journals and assessed the average effect of experimental treatments compared with standard treatments. Importantly, the combined estimate from the meta-analysis will not be influenced by the choice of the non-inferiority margins.

Eligibility and search strategy
We searched for non-inferiority trials on 20 February 2009 using PubMed (National Library of Medicine) with the text words 'noninferiority' or 'non inferiority' or 'equivalence' combined with the text words 'clinical trial' or 'trial' or 'trials' or 'study' or 'studies', limiting the search to publications from 1991 onwards. The initial search was restricted to six general medicine journals (Annals of Internal Medicine, BMJ, JAMA, Lancet, New England Journal of Medicine and PLoS Medicine). In a second step the search was extended to include the other 115 journals included in PubMed's selection of core clinical journals (see http://www.nlm.nih.gov/bsd/aim.html for a list of these journals). Although equivalence trials (trials that specify both a lower and an upper equivalence margin) 20,21 were not eligible for inclusion, we included the term 'equivalence' in our search strategy in order not to miss non-inferiority trials that had been described as equivalence trials.

Study selection
All two-arm parallel group non-inferiority trials of an experimental treatment compared with standard therapy were included, independent of the intervention examined in the trial. Articles that were published electronically ahead of print were also considered.

Data extraction
The following information was extracted independently by two investigators (D.S. and R.M.): year of publication, journal, subject area (cardiovascular medicine, infectious diseases, obstetrics and gynaecology, rheumatology, surgery or other), primary endpoint, non-inferiority margin for primary endpoint, expected incidence of the primary endpoint in the standard arm and the point estimate for the comparison of experimental with standard treatment. The primary endpoint was classified into three categories: (i) mortality alone or as part of a combined endpoint; (ii) clinical disease; and (iii) surrogate endpoint (imaging or laboratory test). Disagreements were resolved in consultation with a third investigator (O.D.). If trials presented more than one primary endpoint, the endpoint for which the sample size had been calculated was used. If it was unclear for which endpoint sample size calculations were done, or if no such calculations were reported, one of the primary endpoints was randomly selected. In trials that included several non-inferiority comparisons using the same standard treatment, e.g. when testing two dosages of an experimental therapy, one comparison was selected at random and included in the analysis. If a study reported both intention-to-treat-and perprotocol analyses, the result used by the author to determine whether the intervention was non-inferior was extracted. If this was not clear, the per-protocol results were used. The funding source was independently classified by two investigators (D.S. and O.D.) as industry, public or mixed funding. The provision of study drugs by industry was considered as a source of industry funding.

Data synthesis and analysis
The confidence intervals (CIs) and non-inferiority margins reported by the investigators were used to classify the results as superior, non-inferior, inconclusive or inferior according to the definitions given in the extension of the Consolidated Standards of Reporting Trials (CONSORT) statement to noninferiority trials. 9 Superiority was assumed if the experimental treatment was significantly (P < 0.05) more efficacious than the standard treatment. Non-inferiority was assumed if the 95% CI did not include the non-inferiority margin. Results were classified as inconclusive if the 95% CI included the non-inferiority margin. Treatments were assumed to be inferior if the entire 95% CI was significantly worse than the non-inferiority margin.
Results of comparisons were expressed as ratio measures, which we call relative risks (RRs) throughout this article. If the trial reported risk ratios or hazard ratios (HRs) or odds ratios (ORs) from statistical models these were used in the analyses. For trials that reported risk differences, we calculated the risk ratio. For studies reporting continuous endpoints (e.g. blood pressure), results were converted to ORs using the method described by Hasselblad and Hedges, 22 and the OR was then used in further calculations. This method assumes logistic distributions with equal variances in the two treatment groups. Under this assumption the natural logarithm of the OR equals a constant multiplied by the standardized difference between means. If needed, the inverse of the RR was calculated, so that ratios 41 consistently favoured the standard treatment and ratios <1 favoured the experimental treatment. The RRs from individual studies were combined using random-effects models. In addition to combining RRs for all studies, we performed stratified meta-analyses according to whether results were interpreted as inferior, noninferior or superior, by type of effect measure, by type of endpoint, by source of funding, by journal and according to whether the judgement of the result was based on an intention-to-treat analysis or not. In a random-effects meta-regression model we analysed the influence of the source of funding.
Pre-specified non-inferiority margins were also expressed as RRs. Margins that were reported as a difference in incidence were converted to RR by dividing the expected incidence of the primary endpoint in the standard treatment arm plus (for morbidity or mortality) or minus (for beneficial endpoints) the pre-specified margin by the expected incidence of the primary endpoint in the standard treatment arm. For example, if the expected mortality rate in the standard treatment arm was 10% and the pre-specified margin was set at 2%, the margin converted to an RR of 1.2 [(10 þ 2)/10]. Margins could not be expressed as RR for studies that did not report the expected incidence, or studies reporting continuous endpoints. We examined the median and the distribution of non-inferiority margins and examined whether margins differed across the subgroups of trials mentioned above. We compared the observed incidence of the primary endpoint in the group that received standard treatment with the expected incidence, as specified by the trialists. The result was expressed as a ratio. If needed, the inverse of this ratio was calculated, so that ratios <1 consistently indicated that the standard treatment performed better than was expected at the design stage of the trial, and ratios 41 indicated that the standard treatment performed worse than was expected. This ratio could not be calculated for studies that did not report separately the expected, or the observed incidence of the primary endpoint, or studies reporting continuous endpoints.

Literature search and study characteristics
We identified 532 potentially eligible articles and excluded 362 studies for the reasons shown in Figure 1. A total of 170 studies, which were published in 43 different journals, were included (see Appendix  Tables 1, 2 and 3 for bibliographic details   available as supplementary data at IJE online). Five articles reported the results for two separate comparisons. In total, 175 comparisons were therefore included in the analyses. The oldest non-inferiority study in our selection dates from 1993. 23 Seventyeight percent of included studies date from 2004 onwards, reflecting an increase in non-inferiority trials in the past 5 years.
For 130 comparisons (74%), we considered the experimental treatment to be non-inferior according to the published criteria. 9 Of note, in 6 of these 130 comparisons the authors deemed the experimental treatment to be clinically inferior based on a secondary endpoint. For 27 comparisons (15%) results were inconclusive, for 15 comparisons (9%) superior and for 3 comparisons (2%) inferior. In 20 instances our assessment differed from the authors' verdict: in 9 instances we judged the result to be superior where the authors' verdict was non-inferior, in 6 instances to be inconclusive as opposed to inferior and once to be non-inferior instead of inferior. The authors' verdict was more favourable to the experimental treatment in four comparisons, each time judging the result to be non-inferior instead of inconclusive.

Meta-analysis
The funnel plot showed a symmetrical distribution of results around RR 1 (Figure 3). Forty-seven percent of comparisons (n ¼ 82) had a point estimate 41 (favouring standard treatment) and 53% (n ¼ 93) <1 (favouring experimental treatment). Of the 130 comparisons judged to be non-inferior, the point estimate favoured experimental treatment in 58% (n ¼ 76) and standard treatment in 42% (n ¼ 54). The combined RR for all 175 comparisons was 0.994 (95% CI 0.978-1.010) using a random-effects model and 1.002 (95% CI 0.966-1.008) using a fixed-effects model. The combined RR for comparisons judged to be non-inferior was 0.995 (95% CI 0.983-1.006). Table  2 shows stratified random-effects meta-analyses according to trial result, measure of effect, type of endpoint, source of funding, by two journal strata and according to whether the judgement of the result was based on an intention-to-treat analysis or not. Using a random-effects model, the combined estimate for trials funded by industry was 0.978 (95% CI 0.956-1.000). The combined result for trials funded by public sources was 1.008 (95% CI 0.980-1.038). These two estimates did not differ significantly (P ¼ 0.15 from random-effects meta-regression). All meta-analyses were also performed using a fixed-effects model and are presented in Appendix Table 4 available as supplementary data at IJE online. The main result and the results from the stratified analyses were similar for the random-and fixed-effects meta-analyses except for a difference in the result stratified by funding source.

Non-inferiority margins
The margin was expressed as an RR for 33 comparisons and could be converted from a risk difference to a RR for 91 comparisons. The median pre-specified non-inferiority margin was 1. -not a non-inferiority or equivalence trial (n = 53) -symmetrical two-sided equivalence margin (n = 50) -trial of a diagnostic tool (n = 10) -data missing (n = 7) -three study arms or more (n = 4) -effect measure cannot be converted to an RR (n = 5) -standard error equals zero (n = 2) -sub-analysis of a trial that had already been included (n = 1)   The ratio of the observed and expected incidence of the primary endpoint in the group that received standard treatment could be calculated for 112 comparisons. Fifty-three percent of comparisons (n ¼ 59) had a ratio <1, indicating that the standard treatment performed better than was expected at the design stage of the trial and 46% (n ¼ 51) 41, indicating that the standard treatment performed worse than was expected. Two ratios were exactly 1. The mean ratio was 0.941 (95% CI 0.859-1.030), meaning that on average the chosen standard treatments performed slightly better than was estimated at the design stage of the trials. Stratified by source of funding, this ratio was 0.974 (95% CI 0.865-1.097) for 55 studies funded by industry and 0.906 (95% CI 0.760-1.080) for 31 studies funded by public sources. These two estimates did not differ significantly (P ¼ 0.5 from t-test for equality of means).

Discussion
In this meta-analysis of trials using a non-inferiority design, experimental treatments were regarded as non-inferior to standard treatments in the majority of studies. The combined RR for these studies comparing experimental with standard treatments was close to 1. For non-inferiority trials published in core clinical journals, this finding contradicts the hypothesis that new treatments that gain a verdict of non-inferiority are systematically less effective than standard treatments.
Our study has several strengths and limitations. We aimed to include all the non-inferiority trials published in these journals, irrespective of the type of endpoints or measures of effect. We restricted the search to the group of core clinical journals, as defined in PubMed, which is the same group of journals as in the Abridged Index Medicus (AIM). This selection covers a wide range of journals from many clinical specialties. Our results may therefore be representative for non-inferiority trials published in other journals. However, external validity may be limited to higher quality journals. If non-AIM journals are of lower quality, the characteristics of non-inferiority trials published in those journals might be different. Furthermore, our search would have missed trials that do not mention the non-inferiority design in the abstract, the title or as a key word. The characteristics of such trials might also differ. We examined two aspects of non-inferiority trials: first, we combined the results of a large number of non-inferiority trials in a meta-analysis; secondly, we examined the non-inferiority margins chosen by the investigators. Importantly, the combined estimate from the meta-analysis will not be influenced by the choice of the non-inferiority margin. The combined estimate will be influenced by the efficaciousness of standard treatment. If the standard treatment is not effective, the experimental treatment is in fact tested against 'placebo' in a non-inferiority design. Although we did not assess whether the chosen standard treatment represented the best-available comparator, we did assess how standard treatments performed in view of what trialists had expected. On average, the standard treatments performed slightly better than was estimated at the design stage of the trials. Our study did not address other important issues pertaining to non-inferiority trials. For example, we did not assess whether a non-inferiority trial was the appropriate design to use (or whether a superiority design would have been more appropriate) or whether the choice of the non-inferiority margin that was used for the power calculation and statistical testing  made clinical sense. The non-inferiority margin is often criticized as being arbitrary, unacceptably wide or even fraudulent. 8,10 The selection of the non-inferiority margin should be based on a combination of statistical reasoning and clinical judgement. 9,24 Others have reviewed the rationale for the size of the margins in non-inferiority trials. 7,8 They found that the majority of trials did not justify the choice of the margin and that <20% reported a clinical consideration. An in-depth analysis of each trial with subject-matter knowledge on each topic would have been required to judge whether the choice of the margin was adequate. This was beyond the aim of the present analysis.
Does our meta-analysis rebuke some of the criticism aimed at non-inferiority trials? We found that the combined RR for all studies was close to 1. This contradicts the hypothesis that in non-inferiority trials the experimental treatment is generally less effective than the standard treatment. We believe that this is an important, reassuring finding, considering the criticism that has been levelled at non-inferiority trials. [1][2][3][4][5][6][7][8][9][10][11][12] Several issues should nevertheless be considered when interpreting this result. First, current standards for drug approval stipulate that a new treatment should be better than placebo and (at least) non-inferior to the established options. This means that demonstrating non-inferiority can legally suffice for the licensing of a new drug. The underlying assumption is often that a 'non-inferior' treatment has added value regarding other properties, such as ease of use, lower costs or fewer adverse effects, which might offset a small loss in efficacy. Sometimes such superior properties, such as costs, are self-evident and do not have to be demonstrated in a trial. Claiming that an agent has less adverse All meta-analyses mentioned here were performed using a random-effects model. All meta-analyses were also performed using a fixed-effects model and are shown in Appendix Table 4 available as supplementary data at IJE online. The main result and the results from the stratified analyses were similar for the random-and fixed-effects meta-analyses except for a difference in the result stratified by funding source.

EFFICACY OF EXPERIMENTAL TREATMENTS IN NON-INFERIORITY TRIALS
effects should however be based on evidence. A separate analysis of the adverse event data, analyses of combined endpoints or a meta-analysis of several trials might be appropriate and informative to demonstrate superiority in this respect. Of the 175 comparisons in this meta-analysis, we considered the experimental treatment to be non-inferior in 130 (74%) and superior in 15 (9%). Although in 6 of these 145 comparisons the authors deemed the experimental treatment to be clinically inferior based on a secondary endpoint, the majority of published non-inferiority trials can be used to support the registration of a new treatment. The added value and safety of these treatments may not always be self-evident and may not always be demonstrated in the trial. The follow-up time and the sample size of trials are limited, making it improbable that rarer side effects or long-term side effects are detected. Secondly, for superiority trials, it has repeatedly been described how the outcome may be skewed in favour of the experimental treatment by making convenient choices when designing the study or reporting or publishing the result. This may involve the choice of (the dosage of) the comparator drug, the choice of patients, endpoints or of the type of analysis. 19,27 It may also involve selective reporting of data or changing the pre-specified endpoint after a study is completed. 16 It is plausible that such mechanisms affect the results of non-inferiority trials. In other words, biased choices in study design and bias due to selective reporting of outcomes may make it more likely that an experimental treatment is considered non-inferior after completion of the trial. We did not have access to the study protocols of the included articles and relied on what was reported as the primary endpoint. Also, we restricted our search to studies that have been published. Unpublished trials are more likely to favour standard treatment. 17 Therefore, if publication bias would be an issue, our results might be skewed in favour of experimental treatments. All these potential sources of bias would remain unnoticed in our meta-analysis. Although the funnel plot showed a symmetrical distribution of results around RR 1, this does not rule out biases. This leaves the possibility that our finding of an overall RR close to unity is skewed in favour of experimental treatment. The finding that studies sponsored by industry were more likely to have results favouring sponsored treatments is in line with other reports. 15,25,26 Systematic bias has been suggested as a possible explanation. Our finding could also be due to the play of chance.
Thirdly, the statistical verdict of non-inferiority permits licensing of a drug even if the trial result shows that it is somewhat less effective than standard. Therefore, some treatments that are approved based on non-inferiority testing may be less effective compared with the standard therapy with respect to the primary endpoint. A cascade of non-inferiority trials is possible, in which each next experimental treatment is slightly less effective than the previously established 'standard'. After several generations of non-inferiority trials, ineffective interventions could be licensed, leading to deteriorating patient care. 4,11 This outcome has been called 'bio-creep'. 28 Our results are relevant in this context. Our study showed that of the 130 comparisons judged to be non-inferior, the point estimate favoured the standard treatment in 42% of trials. Biocreep could occur if two or three trials in succession belong to this 42% category and if each next trial adopts the previously demonstrated non-inferior treatment as the new active control treatment. Importantly, our study provided no empirical evidence for or against the existence of biocreep.
In conclusion, the number of non-inferiority trials published in clinical journals has greatly increased. We found that the experimental treatments that gain a verdict of non-inferiority in trials published in core clinical journals are not systematically less effective than the standard treatments. Biases in design, reporting and publication cannot be ruled out and may have skewed the study results in favour of experimental treatments. Continued vigilance is required to assure that non-inferiority trials are used appropriately.

Supplementary data
Supplementary data are available at IJE online.

Acknowledgement
We acknowledge the contribution of Theo Stijnen, PhD, Professor of Medical Statistics, Department of Medical Statistics and Bioinformatics, Leiden University Medical Centre, for critical discussion of the analyses, and of J.W. Schoones, MA, Walaeus Library, Leiden University Medical Centre, for assistance with the search strategy.
Conflict of interest: None declared.

KEY MESSAGES
There is a concern that non-inferiority trials might be deliberately designed to conceal that a new treatment is less effective than a standard treatment. There is little empirical evidence at present to support this notion, however.
The combined RR from 170 randomized trials using a non-inferiority design and published in core clinical journals in recent years was close to 1, favouring neither the experimental nor the standard treatment.
In the majority of trials, the new treatments were considered to be non-inferior. For these trials the combined RR was also close to 1.
The experimental treatments that gain a verdict of non-inferiority do not, therefore, appear to be systematically less effective than the standard treatments.
The evidence from published non-inferiority trials might still be distorted by publication bias, or by a biased choice of standard treatments. Further studies are required to clarify the risk of bias in non-inferiority trials.