- Split View
-
Views
-
Cite
Cite
N. M. Gibbs, S. V. Gibbs, Misuse of ‘trend’ to describe ‘almost significant’ differences in anaesthesia research, BJA: British Journal of Anaesthesia, Volume 115, Issue 3, September 2015, Pages 337–339, https://doi.org/10.1093/bja/aev149
- Share Icon Share
There are many definitions of ‘trend’, but none apply to differences that have already been found to be non-significant in a statistical test. Yet, there appear to be many examples in the anaesthesia literature of the use of trend to describe differences that have been found by the authors to be ‘almost’ but not quite statistically significant (e.g. P=0.06). The implication appears to be that there is a subset of non-significant P values that suggest, support or represent a trend, by being ‘almost significant’. In this editorial we explain that describing non-significant differences as a trend is an error, and argue that it is neither a trivial nor merely semantic error. We also report an audit that suggests that this form of error is not uncommon in the anaesthesia literature and may be increasing in frequency.
The noun trend is defined as a ‘general direction in which something is developing or changing’ or a ‘fashion’ by the Oxford Dictionary,1 and as ‘a general direction of change’, ‘a way of behaving or proceeding’, ‘something that is developing and becoming more common’, ‘a tendency’ or ‘something that is currently popular or fashionable’, by the Miriam-Webster's dictionary.2 To our knowledge, however, no dictionary or any other authoritative source defines trend as ‘a difference that is almost, but not quite statistically significant’. Most commonly, ‘trend’ is used as a general term in both scientific and non-scientific literature to describe apparent changes as per dictionary definitions. On the other hand, formal statistical tests are available when required to estimate the probability that observed changes in an apparent trend (e.g. in a time series) represent true differences rather than chance findings.3–7 These include the χ2 test for linear trend, the Cochran-Armitage test, and the Mann-Kendall trend test, to name only a few.3–7
Applying the term trend to almost significant differences demonstrates a misunderstanding of the meaning of P values. A P value describes the probability of obtaining the observed result or one more extreme given that the null hypothesis is true.8,9 If the probability is less than a pre-specified value (α, acceptable type I error) set by the authors, the null hypothesis is rejected. Typically this value is 0.05, but could be 0.01 or 0.1 or another value determined by the authors. The outcome of an inferential test is either rejection of the null hypothesis, or failure to reject the null hypothesis. There is no other outcome. In particular, there is no ‘almost rejected’ category when P values approach but are slightly greater than the pre-set α. To imply that there is an ‘almost rejected’ category is an obvious statistical error.
The overwhelming majority of P values >0.05 (or other pre-set α) in the anaesthesia literature (and indeed the scientific literature) are reported as non-significant, without any mention of trend, no matter how close they are to being significant. How then do some ‘almost significant’ P values suggest or support a trend, but not others? What range of P values are considered ‘almost significant’? To be consistent, either all P values within this hypothetical range suggest or support a trend or none. Describing some ‘almost significant’ P values as a trend but not others introduces a large element of subjectivity.
Describing a P value close to but not quite statistically significant (e.g. 0.06) as supporting a trend toward statistical significance has the same logic as describing a P value that is only just statistically significant (e.g. 0.04) as supporting a trend toward non-significance.10 Yet P values that are only just statistically significant are rarely if ever described as supporting a trend toward non-significance. Also, using trend to suggest ‘a general direction of change’ or ‘tendency’ once a finding has already been found to be non-significant (i.e. consistent with a chance finding) is a form of begging the question. The purpose of inferential testing (i.e. obtaining a P value) is to assess the likelihood that observed differences suggest a ‘general direction of change’ or ‘tendency’ as opposed to chance findings.
To estimate how frequently this error occurs in the anaesthesia literature, we conducted an audit of three anaesthesia journals. Using word recognition software we identified all uses of the term trend in all articles (including editorials, excluding case reports) published in the British Journal of Anaesthesia, Anesthesia and Analgesia, and Anaesthesia and Intensive Care in 1990, 2000, and 2010. There was a total of 2143 articles and in 258 articles trend was used at least once. We scrutinized each use of trend to ascertain whether it described a difference that had been found to be non-significant on the basis of the authors' own a priori specified alpha error (e.g. describing a non-significant difference as a ‘trend toward statistical significance’, a ‘non-significant trend’, or as ‘showing a trend’ despite the non-significance). We did not discriminate between primary and secondary outcomes. We found at least one example of this incorrect use in 28 of the 833 articles published by the three journals in 2010 (3.4%). This represented an average of about one example of misuse for each issue of each journal in 2010. For 2000, we found 14 examples among 817 articles (1.7%), and for 1990, 8 examples among 493 articles (1.6%). The 10 examples from the British Journal of Anaesthesia are shown in Table 1. [Details of the incidence across the three journals and a list of all 50 examples are available in Supplementary Appendices.]
Although not statistically significant, there was a trend which suggested that bupivacaine with adrenaline performed best in almost every variable tested; 1990: 65: 648–653 |
The trend toward reduced motor block together with a significantly increased need for supplementary analgesics … confirm findings from a recent study … [There was no significant difference in relation to motor block]. 2000; 84: 826–7 |
Results showed a trend for a lower rate of bacterial infection with the use of tranexamic acid when compared with placebo (P=0.12). 2010; 104: 23–30. |
A trend was also seen towards a reduction in myocardial damage after operation. However, this trend did not reach statistical significance. 2010; 104: 305–12 |
Patients with the combination of both active drugs showed a trend to request opioids later than those with a single analgesic (P>0.05). 2010; 104: 761–7 |
However, none of the LA treatments significantly influenced epidermal or inflammatory cell wound MIF levels, although there was a trend towards increased overall wound MIF levels in animals treated with bupivacaine 0.5%. 2010; 104: 768–73 |
There was a trend to less pain at rest in the F group. [P=0.09]. 2010; 105: 185–95 |
There was a non-significant trend for non-achievers to have higher sedation scores at the time of study entry. 2010; 105: 326–33 |
The study was not powered for subgroup analysis, but there was a trend towards reduced hospital mortality in the cell saver group48. [Reference 48 reports that there was no statistically significant reduction in overall hospital mortality in the cell saver group, P=0.07]. 2010; 105: 401–16 |
The worst postoperative chronic pain score (VAS/NRS) was reported in one trial24showing a trend for a better outcome 12 months after surgery (P=0.14). 2010; 105: 842–52 |
Although not statistically significant, there was a trend which suggested that bupivacaine with adrenaline performed best in almost every variable tested; 1990: 65: 648–653 |
The trend toward reduced motor block together with a significantly increased need for supplementary analgesics … confirm findings from a recent study … [There was no significant difference in relation to motor block]. 2000; 84: 826–7 |
Results showed a trend for a lower rate of bacterial infection with the use of tranexamic acid when compared with placebo (P=0.12). 2010; 104: 23–30. |
A trend was also seen towards a reduction in myocardial damage after operation. However, this trend did not reach statistical significance. 2010; 104: 305–12 |
Patients with the combination of both active drugs showed a trend to request opioids later than those with a single analgesic (P>0.05). 2010; 104: 761–7 |
However, none of the LA treatments significantly influenced epidermal or inflammatory cell wound MIF levels, although there was a trend towards increased overall wound MIF levels in animals treated with bupivacaine 0.5%. 2010; 104: 768–73 |
There was a trend to less pain at rest in the F group. [P=0.09]. 2010; 105: 185–95 |
There was a non-significant trend for non-achievers to have higher sedation scores at the time of study entry. 2010; 105: 326–33 |
The study was not powered for subgroup analysis, but there was a trend towards reduced hospital mortality in the cell saver group48. [Reference 48 reports that there was no statistically significant reduction in overall hospital mortality in the cell saver group, P=0.07]. 2010; 105: 401–16 |
The worst postoperative chronic pain score (VAS/NRS) was reported in one trial24showing a trend for a better outcome 12 months after surgery (P=0.14). 2010; 105: 842–52 |
Although not statistically significant, there was a trend which suggested that bupivacaine with adrenaline performed best in almost every variable tested; 1990: 65: 648–653 |
The trend toward reduced motor block together with a significantly increased need for supplementary analgesics … confirm findings from a recent study … [There was no significant difference in relation to motor block]. 2000; 84: 826–7 |
Results showed a trend for a lower rate of bacterial infection with the use of tranexamic acid when compared with placebo (P=0.12). 2010; 104: 23–30. |
A trend was also seen towards a reduction in myocardial damage after operation. However, this trend did not reach statistical significance. 2010; 104: 305–12 |
Patients with the combination of both active drugs showed a trend to request opioids later than those with a single analgesic (P>0.05). 2010; 104: 761–7 |
However, none of the LA treatments significantly influenced epidermal or inflammatory cell wound MIF levels, although there was a trend towards increased overall wound MIF levels in animals treated with bupivacaine 0.5%. 2010; 104: 768–73 |
There was a trend to less pain at rest in the F group. [P=0.09]. 2010; 105: 185–95 |
There was a non-significant trend for non-achievers to have higher sedation scores at the time of study entry. 2010; 105: 326–33 |
The study was not powered for subgroup analysis, but there was a trend towards reduced hospital mortality in the cell saver group48. [Reference 48 reports that there was no statistically significant reduction in overall hospital mortality in the cell saver group, P=0.07]. 2010; 105: 401–16 |
The worst postoperative chronic pain score (VAS/NRS) was reported in one trial24showing a trend for a better outcome 12 months after surgery (P=0.14). 2010; 105: 842–52 |
Although not statistically significant, there was a trend which suggested that bupivacaine with adrenaline performed best in almost every variable tested; 1990: 65: 648–653 |
The trend toward reduced motor block together with a significantly increased need for supplementary analgesics … confirm findings from a recent study … [There was no significant difference in relation to motor block]. 2000; 84: 826–7 |
Results showed a trend for a lower rate of bacterial infection with the use of tranexamic acid when compared with placebo (P=0.12). 2010; 104: 23–30. |
A trend was also seen towards a reduction in myocardial damage after operation. However, this trend did not reach statistical significance. 2010; 104: 305–12 |
Patients with the combination of both active drugs showed a trend to request opioids later than those with a single analgesic (P>0.05). 2010; 104: 761–7 |
However, none of the LA treatments significantly influenced epidermal or inflammatory cell wound MIF levels, although there was a trend towards increased overall wound MIF levels in animals treated with bupivacaine 0.5%. 2010; 104: 768–73 |
There was a trend to less pain at rest in the F group. [P=0.09]. 2010; 105: 185–95 |
There was a non-significant trend for non-achievers to have higher sedation scores at the time of study entry. 2010; 105: 326–33 |
The study was not powered for subgroup analysis, but there was a trend towards reduced hospital mortality in the cell saver group48. [Reference 48 reports that there was no statistically significant reduction in overall hospital mortality in the cell saver group, P=0.07]. 2010; 105: 401–16 |
The worst postoperative chronic pain score (VAS/NRS) was reported in one trial24showing a trend for a better outcome 12 months after surgery (P=0.14). 2010; 105: 842–52 |
These results confirm that there is a subset of articles in the anaesthesia literature in which trend is being misused to describe ‘almost significant’ differences. Moreover, we found an increase over the three index years consistent with a true trend (Cochran-Armitage test for trend, P=0.021, Fig. 1). This observation serves to highlight the ambiguity that is introduced if trend is used for statistical findings that have not been subject to a specific test for trend, and moreover, which have been found to be non-significant in another test. Furthermore, this incorrect use of trend represented only a small proportion of the total uses of trend in our audit. The majority of uses were for correct purposes (e.g. in relation to dictionary definitions or specific statistical tests for trend). This majority correct use is undermined by the ambiguity introduced by the small subset of misuse.
Although we audited only three anaesthesia journals, we have no reason to suspect that our findings are not typical of the broader range of anaesthesia journals. The three journals are published in different regions indicating that this is not a local issue. Moreover, two of the journals have a high impact factor and wide readership, and would be considered to be amongst the mostly highly regarded in the anaesthesia literature.
It is likely that the use of trend to describe almost significant differences is mostly an innocent error, and that the intention is to imply only that the observed differences, although non-significant, may be worthy of further investigation in subsequent more highly powered studies. This may be an entirely appropriate interpretation, as a negative finding is never proof of ‘no difference’. On the other hand, misuse of trend to describe almost significant differences could be misinterpreted by less informed readers as suggesting a real trend, which would be misleading.
In summary, the use of trend to describe ‘almost significant’ differences is an error both in word usage and statistical inference. It introduces both inconsistency and ambiguity. It promotes a misunderstanding of P values and undermines the many correct uses of the term. More importantly, it may be misleading if readers assume that a real trend has been suggested, supported or demonstrated. Our audit findings indicate that this error is not uncommon in anaesthesia research and may be increasing. We recommend that trend should not be used to describe any subset of non-significant differences and should be reserved only for the currently accepted dictionary or scientific definitions of the term, or in relation to specific statistical tests for trend.
Supplementary material
Supplementary material is available at British Journal of Anaesthesia online.
Declaration of interests
N.M.G. is currently the Chief Editor of Anaesthesia and Intensive Care. S.V.G. has no interests to declare.
Acknowledgement
The authors wish to thank Dr William Weightman for performing the Cochran-Armitage test for trend.
Comments
Gibbs and Gibbs [1] argue that the word "trend" is misused by authors considering their statistical results.
One reason why statistics is poorly understood is that many words in statistics books are not used in the way they are used colloquially, or even as they may be defined in the dictionaries.[2] Semantics aside, it's clear that some disapprove of "moving goalposts" and claims of "almost significance". However many others might argue that P is a continuous variable, and the difference between 0.049 and 0.51 must perforce be subtle.
In statistical judgement, little is utterly firm, and context is all- important. Thus, the US Supreme Court didn't accept a cutoff value for significance in a recent famous verdict. (Matrixx initiatives, inc., et al. V. Siracusano et al. United States court of appeals for the ninth circuit No. 09-1156. Decided March 22, 2011). It sensibly disagreed with an argument that "reports of adverse events ... cannot be material (if there are not) a sufficient number of such reports to establish a statistically significant risk that the product is in fact causing the events". The court's view was that "A lack of statistically significant data does not mean that medical experts have no reliable basis for inferring a causal link between a drug and adverse events". The court was not willing to accept a single "bright line" on the statistical spectrum that separated true from false.
Even more important, a naked P value on its own is not the final arbiter of significance that many believe it to be. If a test cannot persuade us to reject the null hypothesis, many would wish to know the power of the test, concerned that the negative result should be false, and that "not different" is not the same as "the same". However non- statisticians often fail to realize that the power of a test is equally important when considering statistically significant results, when the null hypothesis has been considered untenable, because P was <0.05, or any other cutoff value had been arbitrarily set to indicate "unlikely". Unfortunately P values are fickle, and the statistical power of a test dramatically affects our capacity to interpret the P value: once more, context is very important.
Random variation means that many experiments, often conducted with small samples, are intrinsically unreliable. To demonstrate this, we conducted a simple simulation. [3] We compared two normally distributed populations that WERE different: the SD of each population was 1, and the means of the populations differed by 0.5 (i.e. there was a true, but small, difference). Many would believe that the results of T tests conducted using repeated samples taken from these populations would yield P values less than 0.05 on about 95% of occasions, arguing that after all, in this situation, the null hypothesis has actually been set up to be false. Not so: with small samples (10 per group), repeated tests yield a P< 0.05 only 18% of the time. To obtain a test with a power around the "usual" value of 80%, samples of 64 per group are needed. In these conditions, then 80% of the tests are "significant" using the usual definition. However the variation in P values obtained from these repeated tests is extreme: they range from <0.001 to 0.4. Remember, P is a continuous variable: in this context, it expresses the "likelihood" that the samples come from a single population. Each time the test is replicated, we obtain a very different P value, because random samples vary, and underpowered tests are unreliable. Only by using samples of 100, do we consistently obtain a small P value.
Given the usually small samples and low-powered studies that appear in the anaesthesia literature, we have to contend with the frequent possibility that if the experiment were repeated, the P value we obtain could be substantially different from the P value we got before. If so, then the 'almost rejected' category is more likely to be a valid inference, and not "an obvious statistical error". The samples drawn in each study are not sacrosanct, merely random: and small samples are less likely to precisely assess the population from which they have been drawn.
Worse, unless the samples are very large (so that the sampled populations can be defined with adequate precision) then the estimated effect size is exaggerated. This is the "winners curse": a small random sample can by chance be extreme: thus tripping the P<0.05 line and at the same time suggesting a substantial difference. Although this is the result that is accepted for publication, we should be aware that a result isn't necessarily reliable when P<0.05, unless the power is adequate. How many of the 50 "trend" papers identified also reported the power of the study? We know that many "positive results" cannot be replicated, even when these appear to be scientifically sound.[4] These positive results came from small perhaps underpowered studies. A surprising positive result should as suspect as the "borderline" positive result, and we should stop taking P<0.05 as a signal of automatic veracity.
References
1. Gibbs NM, Gibbs SV. Misuse of 'trend' to describe 'almost significant' differences in anaesthesia research. Br J Anaesth 2015; 115: 337-9
2. Drummond GB, Tom BDM. Statistics, probability, significance, likelihood: words mean what we define them to mean. J Physiol 2011; 589: 3901-4
3. Halsey LG, Curran-Everett D, Vowler S, Drummond GB. The fickle P value generates irreproducible results. Nature Methods 2015; 12: 179-185
4. Ioannidis JP. Contradicted and initially stronger effects in highly cited clinical research. JAMA 2005; 294: 218-28
Conflict of Interest:
None declared
We would like to thank Doleman et al (1) for their interest in our editorial (2), but disagree with many of their comments. We do not agree that the misuse of 'trend' to describe almost significant differences is a semantic issue. We feel that it is an error. Also, we cannot see how drawing attention to this relatively rare error clouds the 'greater issue in the reporting of statistics in the scientific literature'. Doleman et al suggest that P values are outdated and should be avoided. However, their arguments are based on the misuse and misinterpretation of P values, not on their correct use and interpretation. Moreover, many of their arguments apply equally to the use of confidence intervals (CI). For example, both CI width and P value are heavily dependent on sample size (3). Similarly, the choice of a significance level of P < 0.05 is no more arbitrary than the choice of a 95% CI as opposed to a 90% CI or another value. It is also no more absurd to have P = 0.04 as significant and P = 0.06 as non-significant than it is to have a CI lower limit = +0.1 as significant and a lower limit = -0.1 as non-significant (or vice versa). More importantly, both approaches require reference to a pre- determined clinically significant effect size for a correct interpretation, not only CIs (3,4,5,6). Another fundamental issue, which Doleman et al do not mention, is the requirement for adequate a priori power whether P values or CIs are used (3,4,5,6).
In the examples they provide, the correct interpretation of the P values would be the same as the correct interpretation of the CIs. In other words, it is not the use of P values that result in 'widely different and erroneous conclusions'; it is the misuse. Nevertheless, we are all too aware of the widespread misinterpretation of P values in the anaesthesia literature, as has been reported by Gibbs and Weightman recently (6,7), and previously by many other authors (8). Where we differ from Doleman et al is that we would encourage the correct use of P values, rather than their abandonment. The correct use would include avoiding the term 'trend' to describe 'almost significant' differences.
We agree that CIs provide valuable information on the range of likely true effect sizes and the precision of estimates in the population being studied. However, we would caution that to be used and interpreted correctly, CIs are as dependent as P values on reference to the pre- specified minimum clinically important effect size and adequate a priori power (3,4,5,6). We suspect that there are many examples of CI use in the anaesthesia literature where these conditions are not met.
NM Gibbs
SV Gibbs
References
1. Doleman B, Lund JN, Williams P. Re: Misuse of 'trend' to describe 'almost significant' differences in anaesthesia research. Br J Anaesth 2015:eLetter
2. Gibbs NM, Gibbs SV. Misuse of 'trend' to describe 'almost significant' differences in anaesthesia research. Br J Anaesth 2015; 115: 337-9.
3. Daly LE. Confidence intervals and sample sizes. In: Altman DG, Machin D, Bryant TN, Gardner MJ eds. Statistics with confidence. 2nd ed. Bristol: BMJ Books, 2000:139-52.
4. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 1994; 121:200-6.
5. Katz MH. Study design and statistical analysis. A practical guide for clinicians. Cambridge: Cambridge University Press, 2006:127-40.
6. Gibbs NM, Weightman WM. Beyond effect size: consideration of the minimum effect size of interest in anesthesia trials. Anesth Analg 2012; 114: 471-5.
7. Gibbs NM, Weightman WM. An audit of the statistical validity of conclusions of clinical superiority in anaesthesia journals. Anaesth Intensive Care 2014; 42: 599-607.
8. Goodman NW, Powell CG. Could do better: statistics in anaesthesia research. Br J Anaesth 1998; 80: 712-714.
Conflict of Interest:
None declared
We would like to thank Dr Smith (1) for his interest and comments on our editorial (2). We agree with his remarks about medical conservatism and appreciate the desire of many authors to highlight results that are almost, but not quite, statistically significant. However, this can be achieved without using the potentially misleading term 'trend'. Authors can legitimately describe such findings as 'encouraging', 'of interest', or 'worthy of further investigation', so long as the lack of statistical significance is clear, and the same interpretation is given to all P values in their study that fall within this 'almost statistically significant' range, and not only those that support their test hypothesis. Alternatively, authors are at liberty to accept a larger alpha error from the outset (e.g. using P < 0.1 as statistically significant rather than < 0.05). We agree that P values present a continuum and that actual P values contain valuable information, whether or not they are 'significant'. For this reason, we would encourage reporting of actual P values under all circumstances (i.e. not only as significant vs non- significant). We consider that when interpreted correctly, P values already 'speak for themselves', although the effect size, pre-specified minimum clinically important effect size, and beta error must also be considered (3). We feel that it is not necessary 'to drop the concept of significance'. It is necessary only to use it correctly and to know its limitations.
NM Gibbs
SV Gibbs
References
1. Smith I. Misuse of 'trend' to describe 'almost significant' differences in anaesthesia research. Br J Anaesth 2015:eLetter
2. Gibbs NM, Gibbs SV. Misuse of 'trend' to describe 'almost significant' differences in anaesthesia research. Br J Anaesth 2015; 115: 337-9.
3. Gibbs NM, Weightman WM. Beyond effect size: consideration of the minimum effect size of interest in anesthesia trials. Anesth Analg 2012; 114: 471-5.
Conflict of Interest:
None declared
Dear Sir,
I thank Drs Gibbs and Gibbs for highlighting the linguistic and statistical inaccuracies in the use of "trends" in anaesthesia research literature (BJA 2015; 115(3): 337-9). As a teacher and reviewer, I have argued against this practice, while as an author I have almost certainly fallen into the same trap myself! While I cannot disagree with a word of the editorial, I can perhaps offer a little more insight into why this trend might be developing. The authors correctly describe how a predefined p value is used to determine whether the null hypothesis is rejected or accepted and hence whether results are statistically significant or not. Because of medical conservatism, we are reluctant to replace a tried and tested therapy with something relatively new, and so the threshold for rejecting the null hypothesis is set high deliberately, typically at a level where the observed result would only have occurred by chance less that five times in a hundred. Conversely this means that, in the area in which "trends" tend to be mentioned, that the observed result is likely to represent a genuine difference between treatments approximately ninety to just under ninety five percent of the time. In other circumstances, these would still be very good odds indeed! It is not surprising that authors want to highlight these results, even if trend is not the correct word to use. Statistical tests provide a precise estimate of the likelihood that the results are due to random chance, or to a genuine effect, along a continuum extending from near certainty to almost infinite improbability. Yet we choose to discard almost all of this information and distil it into an arbitrary, yes/no cut-off between significant or non-significant. Perhaps the time has come to drop the concept of significance and simply let the p-values talk for themselves.
Conflict of Interest:
None declared
Dear Editor,
We read with interest the article by Gibbs and Gibbs on the misuse of the word trend to describe 'almost' significant p values [1]. However, we believe this discussion of semantics is clouding a somewhat greater issue in the reporting of statistics in the scientific literature. As Gibbs and Gibbs noted, p values are inferential statistics that give the probability of obtaining a value the same or greater than that found if the null hypothesis was true. However, we argue that p values are outdated and for clinical studies in particular, their use should be avoided. This is not a new concept, however, we hope this letter will serve to remind the readership of the flaws in null hypothesis statistical testing. We will first highlight the problems related with the [mis]use of p values, and then discuss the advantages of estimation-based methods, illustrating this with a theoretical example.
Firstly, the use of p values can often detract from a more important issue when conducting a clinical study, the assessment of clinical significance. The misinterpretation of p values means that readers may mistake 'statistical significance' as 'clinical significance'. As the calculation of p values is heavily dependent on sample size, large studies may demonstrate very small p values that are not clinically significant. In fact, such small p values may in fact be evidence against the use of a particular treatment, as it can make us more confident that a treatment will not have a clinically significant effect (see later example). Secondly, the choice of a significance level of p<0.05 is arbitrary. Such a level may have originated from statistician Ronald Fisher who suggested this as a sensible cut-off, although he never advocated this as an absolute rule [2]. It is absurd to suggest a study that reports a p value of 0.04 is 'positive' and another that reports a p value of 0.06 is negative [3]. Such 'negative' p values promote fear in researchers and students while also rendering a study less likely to be published [4].
Ever since the end of the 1970's, the use of confidence intervals (CI) was proposed as an alternative to p values [5]. Confidence intervals present a range of values with which the population mean is likely to lie. More specifically, a 95% confidence interval states that should the experiment be repeated 100 times, the mean of 95 of these experiments would fall within this interval [6]. Such an approach is advocated by the International Committee of Medical Journal Editors who state 'When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as P values, which fail to convey important information about effect size and precision of estimates'.
Confidence intervals contain a wealth of information as well as including all the information of a traditional p value. If the extreme limit of the confidence interval traverses the null result, then the p value is >0.05 (if 95% CI are used) [7]. Moreover, confidence intervals can be better used to assess the likelihood an intervention has a clinically significant effect. Consider an example, we wish to know the efficacy of two different analgesic agents (x and y) for treating postoperative pain. We undertake two randomised controlled trials (RCT) with both agents and pre-determine a clinically significant reduction in pain as 15mm (on a 100mm VAS) [8]. The first RCT with agent x enrols a large number of participants and demonstrates a mean difference of -5mm (95% CI -3mm to -7mm; p<0.001). The second study with agent y recruits much fewer participants and demonstrates a mean difference of -12mm (95% CI 0.1mm to -24.1mm; p=0.06). Although the p value of the first study is very low, the results indicate that we can be confident this agent does not produce a clinically significant effect and should therefore not be used. The second study, although not statistically significant, does not exclude a clinically significant effect and requires more studies to be conducted in order to increase power and narrow the confidence interval. If we had relied solely on statistical significance, widely different and erroneous conclusions would be made.
We accept that this argument is not new and we applaud the British Journal of Anaesthesia for promoting confidence intervals in their instructions for authors page. However, we feel that this letter should serve as a reminder to the readership both the flaws of p values and the advantages of confidence intervals. We hope this will also encourage authors to report confidence intervals wherever possible.
References
[1] Gibbs NM, Gibbs SV. Misuse of 'trend' to describe 'almost significant' differences in anaesthesia research. British Journal of Anaesthesia 2015: aev149.
[2] Sterne JA, Smith GD. Sifting the evidence--what's wrong with significance tests? Physical Therapy 2001; 81: 1464-1469.
[3] Rosnow RL, Rosenthal R. Statistical procedures and the justification of knowledge in psychological science. American Psychologist 1989; 44: 1276.
[4] Stern JM, Simes RJ. Publication bias: evidence of delayed publication in a cohort study of clinical research projects. British Medical Journal 1997; 315: 640-645.
[5] Rothman K. A show of confidence. New England Journal of Medicine 1978; 299: 1362-3.
[6] Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal 1986; 292: 746-750.
[7] Cohen J. The earth is round (p < .05). American Psychologist 1994; 49: 997-1003, p. 997.
[8] Gallagher EJ, Liebman M, Bijur PE. Prospective validation of clinically important changes in pain severity measured on a visual analog scale. Annals of Emergency Medicine 2001; 38: 633-638.
Conflict of Interest:
None declared