The earth is round (P < 0.05) (Cohen, 1994)

The increased involvement of statisticians in the peer review process has lead to a greater communication between statisticians and journal editors (Marks et al., 1988). Despite this, the quality of statistical reporting in many scientific journals remains below par. Statistical errors are commonplace and the problem appears to be long-lasting [see, e.g. (Schor and Karten, 1966; Feinstein, 1974; Gore et al., 1977; White, 1979; Glantz, 1980; Altman, 1982; Bland and Altman, 1987; Pocock et al., 1987)]. The main problem for the researcher would appear to be a lack of understanding of even the most basic of statistics (Mathews and McPherson, 1987). Part of the blame has been apportioned to poor statistical teaching (Rigby, 1998); others blame lack of interest by the researcher (Mathews and McPherson, 1987). Whatever the reasons, inappropriate use of statistics can lead to rejection of manuscripts. According to Sorenson et al. (Sorenson et al., 1998) two-thirds of peer reviewed manuscripts submitted to Health Education Research are rejected, usually after one review. A common reason for rejection is flaws in the data analysis.

What potential authors need to know is how to overcome rejection at the first stage. As a statistical referee for many scientific journals, one of my pet hates lies with authors who present their results with a plethora of P-values (e.g. P < 0.05, P < 0.01, P < 0.001, P < 0.0001). What I prefer to see, in any scientific article that merits a statistical analysis, is a statement about effect size rather than an overemphasis on P-values.

But what is it about P-values and their presentation that I find so distracting. Two decades ago the clinical epidemiologist Alvan Feinstein (Feinstein, 1977) commented ...scientific reputations are made or lost on the basis of the magisterial phrase and number: statistical significance at P ≤ 0.05'. To understand Feinstein's concern it is necessary first to define a P-value. Altman (Altman , 1991) defines a P-value as ...the probability of having observed our data when the null hypothesis is true'. This is a useful starting point in trying to understand more about P-values. I think of a P-value as an error probability (Table I). The information presented in Table I represents a cross-classification of the outcome of a statistical test with the truth'. This results in a 2×2 contingency table with four possible outcomes in terms of statistical errors. If the test statistic concludes that there is a difference between groups of observations when a difference does not exist, this results in a statistical error. This type or error is called a type I error. The type I error is also known as an α error or P-value or statistical significance. Type I errors are usually set at a threshold of 5%. However, in an increasingly evidence-based world the 5% threshold for statistical significance has no evidence-base itself; it is an entirely arbitrary threshold. The great and the good in statistics have tried to find out when, where and why P < 0.05 as statistically significant' came into being. Consensus is that it just emerged and the mud has stuck, so to speak. This conclusion is reinforced by Feinstein (Feinstein, 1975) who commented The role of this number [0.05] has become so widely accepted and worshipped that one might expect to find a record of time and place when the apotheosis occurred. No such record exists'. P = 0.05 means that it is accepted that a statistically significant difference will occur 5% of the time if the null hypothesis is true (Table I).

Consider the statement P = 0.000001 and think of this as an error probability (Table I). This means that a statistically significant result will occur 1 in a million times if the null hypothesis is true. It is entirely possible that the resulting P-value could be less than 0.000001. However, it never actually becomes zero. There is always the possibility that an observed difference arose by chance. Apart from the sheer difficulty in interpretation of such low' P-values, I have argued that such precision has limited scientific value (Rigby, 1998). For these reasons, when I report P-values I adopt a simple dichotomy; significant or not significant at a predetermined level (0.05 say). This simple dichotomy does not go down well with some. Gardner and Altman (Gardner and Altman , 1986) have argued that such a dichotomy ...encourages lazy thinking...'. This is of no concern to me. If I have set a threshold for significance at an albeit arbitrary 5% level then I have no interest in how much below (or indeed above) the actual significance level may fall (or rise). For me, the dichotomy serves a purpose; it simplifies the reporting of results. Phrases such as very significant' or more significant' which are often implied by statements like P = 0.000001 are anathema to me. In my line of work, the degree of significance' does not come into play and I am happy to go with the dichotomy.

What about non-significant P-values? Perhaps surprisingly they can be as difficult to interpret as significant P-values. Two explanations can account for non-significant results. First, there really could be no difference between the treatment groups (bottom left-hand cell of Table I). Second, a type II error could have been committed (bottom right-hand cell of Table I). To minimize type II errors, the power of the study should be increased. Power is also defined in Table I (top right-hand cell). Thus, power is defined as the chance (or probability) of showing a difference between treatment groups, if a difference actually exists. However, the research evidence suggests that many studies are underpowered. For example, in a much publicized study, Freiman et al. (Freiman et al., 1978) found that only 30% of trials published in the New England Journal of Medicine were sufficiently large to have a 90% chance (i.e. power) of detecting even large differences in effectiveness of treatments being compared. To coin a phrase Absence of evidence is not evidence of absence' (Altman and Bland, 1995).

Both significant and non-significant P-values require explanation which is not always as easy as it may seem. So, in line with others of my profession, I do not recommend reporting of results using P-values alone. So, if not by P-values alone then by what other method? I advocate presenting results of statistical analyses using confidence intervals. A confidence interval is a measure of the precision of the sample statistic (e.g. mean, odds ratio, regression coefficient, correlation coefficient). The narrower the confidence interval, the more precise the sample statistic and vice versa. Although confidence intervals give more information about the findings of a study, confidence intervals and P-values are related (Gardner and Altman, 1986; Pocock et al., 1987). In a statistical comparison of two means (for example) a treatment difference that is significant at the 5% level has a 95% confidence interval that does not include a zero difference.

The advocacy of confidence interval estimation for presentation of results in scientific journals is not new [see, e.g. (Yates, 1951; Savage, 1957; Rozeboom, 1960; Gardner and Altman, 1986; Simon, 1986; Bulpitt, 1987) and others]. One of the first to make such recommendation was Frank Yates (Yates, 1951). However, Yates' ideas did not catch fire' until Gardner and Altman's seminal paper which was published in the British Medical Journal (Gardner and Altman, 1986). The widespread readership of the British Medical Journal has been cited as one reason as to why it has been Gardner and Altman rather than others who have taken the credit for the better reporting of statistics today (Rigby, 1998). Whomsoever takes (or rather should take) the credit, statistical reporting in scientific journals is improving, albeit slowly. Although many journals still do not have the services of a statistical referee (statistics is after all an uncommon profession), journal editors today are becoming more aware of the issues (Finney and Harper, 1993). If journal editors do not encourage the use of confidence interval estimation the P-value culture' identified by John Nelder (Nelder, 1986, 1999) will remain.

Can budding authors to Health Education Research learn anything about getting their work published? Articles which require statistical presentation should quote effect size with 95% confidence intervals; P-values should be used sparingly. With this in mind potential authors should not fail at the first hurdle; getting past the statistical referee.

Table I.

Statistical errors in hypothesis testing (modified from Rigby, 1998)

Truth
Null hypothesis [there isno difference (A = B)] Alternative hypothesis [there is a difference (A ≠ B)]
Possible outcomes in terms of statistical errors.
No error if we conclude A = B when A = B.
No error if we conclude A ≠ B when A ≠ B.
Type I error if we conclude A ≠ B when A = B.
Type II error if we conclude A = B when A ≠ B.
Outcome of test Significant difference detected (A ≠ B) type I α power (1–β)
No significant difference detected (A = B) no error (1–α) type II β
Truth
Null hypothesis [there isno difference (A = B)] Alternative hypothesis [there is a difference (A ≠ B)]
Possible outcomes in terms of statistical errors.
No error if we conclude A = B when A = B.
No error if we conclude A ≠ B when A ≠ B.
Type I error if we conclude A ≠ B when A = B.
Type II error if we conclude A = B when A ≠ B.
Outcome of test Significant difference detected (A ≠ B) type I α power (1–β)
No significant difference detected (A = B) no error (1–α) type II β

References

Altman, D. G. (
1982
) Statistics in medical journals.
Statistics in Medicine
,
1
,
59
–71.
Altman, D. G. (1991) Practical Statistics for Medical Research. Chapman & Hall, London.
Altman, D. G. and Bland, J. M. (
1995
) Statistics notes: absence of evidence is not evidence of absence.
British Medical Journal
,
311
,
485
.
Bland, J. M. and Altman, D. G. (
1986
) Caveat Doctor: a grim tail of medical statistics textbooks.
British Medical Journal
,
279
,
979
.
Bulpitt, C. J. (
1987
) Confidence intervals.
Lancet
,
i
,
494
–497.
Cohen, J. (
1994
) The earth is round (P < .05).
American Psychologist
,
49
,
997
–1003.
Feinstein, A. R. (
1974
) A survey of statistical procedures in general medical journals.
Clinical Pharmacology and Therapeutics
,
15
,
97
–107.
Feinstein, A. R. (
1975
) Biological dependency, hypothesis testing', unilateral probabilities, and other issues in scientific direction vs. statistical duplicity.
Clinical Pharmacology and Therapeutics
,
17
,
499
–513.
Feinstein, A. R. (1977) Clinical Biostatistics. Mosby, St Louis, MO.
Finney, D. J. and Harper J. L. (
1993
) Editorial code for presentation of statistical analyses.
Proceedings of the Royal Society of London Series B
,
254
,
287
–288.
Freiman, J. A., Chalmers, T. C., Smith, H. and Keubler, R. R. (
1978
) The importance of beta, type II error and sample size in the design and interpretation of the randomized controlled trial.
New England Journal of Medicine
,
299
,
690
–694.
Gardner, M. J. and Altman, D. G. (
1986
) Confidence intervals rather than p-values: estimation rather than hypothesis testing.
British Medical Journal
,
283
,
600
–602.
Glantz, S. A. (
1980
) Biostatistics: how to detect, correct and prevent errors in the medical literature.
Circulation
,
1
,
1
–7.
Gore, S. M., Jones, I. G. and Rytter, E. C. (
1977
) Misuse of statistical methods: critical assessment of articles in the BMJ from January to March 1976.
British Medical Journal
,
1
,
85
–87.
Marks, R. G., Dawson-Saunders, E. K., Bailar, J. C., Dan, B. D. and Verran, J. A. (
1988
) Interactions between statisticians and biomedical journal editors.
Statistics in Medicine
,
7
,
1003
–1011.
Mathews, D. R. and McPherson K. (
1987
) Doctors' ignorance of statistics.
British Medical Journal
,
294
,
856
–857.
Nelder, J. A. (
1986
) Statistics, science and technology.
Journal of the Royal Statistical Society Series A
,
149
,
109
–121.
Nelder, J. A. (
1999
) From statistics to statistical science (with comment).
The Statistician
,
48
,
257
–269.
Pocock, S. J., Hughes, M. D. and Lee, R. J. (
1987
) Statistical problems in the reporting of clinical trials.
New England Journal of Medicine
,
317
,
426
–432.
Rigby, A. S. (
1998
) Statistical methods in epidemiology. I. Statistical errors in hypothesis testing.
Disability and Rehabilitation
,
20
,
121
–126.
Rozeboom, W. W. (
1960
) The fallacy of the null hypothesis significance test.
Psychological Bulletin
,
57
,
416
–428.
Savage, I. R. (
1957
) Nonparametric statistics.
Journal of the American Statistical Association
,
52
,
331
–344.
Schor, S. and Karten, I. (
1966
) Statistical evaluation of medical journal manuscripts.
Journal of the American Medical Association
,
195
,
145
–150.
Simon, R. (
1986
) Confidence intervals for reporting of results of clinical trials.
Annals of Internal Medicine
,
105
,
429
–435.
Sorenson, J. R., Steckler, A. and Bernhardt, J.
1998
Eighteen months and 100 manuscripts later (Editorial).
Health Education Research
,
13
,
i
–ii.
White, S. J. (
1977
) Statistical errors in papers submitted to the British Journal of Psychiatry.
British Journal of Psychiatry
,
135
,
336
–342.
Yates, F. (
1951
) The influence of statistical methods for research workers on the development of the science of statistics.
Journal of the American Statistical Association
,
46
,
19
–34.