In this issue of the journal, Prof. Jon Wakefield provides a contribution to the complex topic of screening genetic associations.1 My comments here are intended to clarify some points and outline connections of his discussion to broader problems, describing how methods such as Wakefield's can be appropriate in epidemiology beyond genetic research. I will also comment on a few aspects of his presentation related to technical issues. I will assume (as does Wakefield) that the reader is familiar with the terminology of Bayesian statistics2 as well that of conventional (frequentist) statistics.

I will start with a comment on interpretation of P-values, and then turn to the issue of multiple comparisons.

Proper interpretation of P-values and confidence intervals

As Wakefield mentions, for continuous distributions a valid frequentist P-value is uniformly distributed under the null hypothesis. From this fact, we can deduce that, under the null and assuming no bias, 0.3% of the P-values will be 0.003 or less over repeated sampling. When citing Sellke et al.3 Wakefield might seem to claim that this deduction is incorrect. But the deduction is correct, whether or not we happen to observe P = 0.003, since it refers to repeated sampling under the null, not to the observed data set alone.

To sort out the issues of concern it helps to keep P-values distinct from α- levels, for the two are very different concepts and arise from entirely different schools of frequentist inference. The P-value originated several centuries ago (albeit under other names;4,,5) and only later became the centerpiece of R.A. Fisher's significance testing approach, which entails no mathematical decision rule.6–9 In contrast, the α-level is the maximum allowable Type I error rate for a test; it originated in the 1920s in Neyman and Pearson's decision-theoretic approach to statistical hypothesis testing.7–10

Among valid concerns underlying Wakefield's comment is that we might make claims about a P-value that was picked from multiple ones based on its magnitude (e.g. the smallest P observed), or alter our decision rule upon seeing the P-value (e.g. switch from a 0.05 α-level to a 0.10 α-level if and only if 0.05 < P < 0.10). Such misuse seems invited when someone reports a P-value result by saying the observation is ‘significant at the P-level’, as Fisher sometimes did (albeit he began this practice before Neyman and Pearson's work appeared).4–6

But α-levels and P-values are so closely related mathematically that their confusion seems inevitable, especially because both have been called ‘significance levels’ and ‘critical levels’ by authors who know the distinction perfectly well. In his classic book on statistical testing,10 Erich Lehmann laid out their correct relation within the Neyman–Pearson theory. I think the following points provide an accurate summary of his more symbolic prose:

  1. Imagine we are considering an unlimited sequence of randomized studies (randomized experiments or random-sample surveys) that generate a corresponding sequence of data sets. From each of those we may compute a decision (‘hypothesis test’).

  2. Once we settle on a decision (‘testing’) method appropriate to the study designs in the sequence, each α-level will define a different decision rule for ‘accepting’ or ‘rejecting’ a given test hypothesis. We thus have an infinite number of decision rules, one for each alpha.

  3. All these rules are defined without reference to any particular data set in the sequence.

  4. We can take a given data set and see what each decision rule would yield from that data set.

  5. The P-value from a given data set is the boundary between the α-rules that ‘accept’ and the α-rules that ‘reject’ the test hypothesis when they are fed this data set.

The last point is crucial. In fact, unlike Fisher and many other authors since,1,,3,7–9 Lehmann defines the P-value as this boundary alpha10 (p. 70). Yet, I fear Wakefield 1 and Sellke et al.3 could be read as if they were making the absurd claim that Lehmann's final point is wrong. True, if we pick our P-value from a larger set of P-values, or change our alpha based on the given data, we can invalidate our tests in a spectacular fashion. But the above points are just part of Neyman–Pearson hypothesis-testing theory and entail no mistake.

The difficulty of creating clear and accurate verbal descriptions of frequentist logic is yet another argument (among others2) for teaching the comparatively straightforward logic of Bayesian statistics, even if one argues that the latter should be calibrated via frequency calculations.1,,11,12

Multiple-comparisons problems are real—and common

In the late 20th century, a sharp division emerged between epidemiologists who condemned all multiple-comparison procedures and statisticians who simply carried on with conventional methods such as Bonferroni adjustments. Both sides made legitimate criticisms of the other's stance, but offered unsatisfactory solutions in return. These issues were brought to the fore by an extended debate in the 1990s.13–21

Suppose one were focused on estimating a single effect, as in a systematic review of β-carotene intake and lung cancer. Some epidemiologists argued that, in principle, it would be nonsensical to worry about what else was examined in the studies, and thus denied the need for multiple-comparison procedures.13,,16 After all, the argument went, the observed association could not depend on whether one looked at other associations.

In reality, careful study of other associations in the data could lead to alterations in confounding control, and in this way alter the estimated β-carotene effect. Furthermore, if one saw the relation of smoking to lung cancer was weak in one of the data sets, it might suggest serious error in the study and reduce the credibility of the estimated β-carotene effect from that study. Continued reasoning along these lines could lead one to employ methods for simultaneous estimation of effects such as shrinkage or Bayes procedures, even if one were ultimately concerned with just one effect and considered the rest only as potential confounders.22,,23

More telling is that the argument against considering whether multiple comparisons were done assumes that what is reported in the literature is an unbiased sample of all that was examined in all the studies. This assumption ignores the selection biases generated by typical uses of conventional statistics.15–17,,22,24–26 The classic such problem is publication bias. A study team with many possible comparisons might present details only for ‘interesting’ findings. In particular, it appears that study teams more often publish results on associations if they are ‘statistically significant’ (P < 0.05)25,,26 unless there are countervailing forces affecting publication decisions.27 If the P-value is a factor in this selection process, the result is a bias away from the null among the results available for inclusion in reviews, whether one considers results across studies (as in meta-analysis) or results within a study (as in genome scans).26 This bias is of special concern in studies that collect data on many variables, whether lifestyle, occupational or genetic.

Publication bias can thus be seen as a variant of the same selection problem that arises in multiple comparisons. Furthermore, even if one focuses on a single association across studies, the publication-bias problem would be aggravated by conventional multiple-comparison adjustments within studies if those studies publish only significant associations. Making a Bonferroni or other conventional correction within studies in effect decreases the α-level required for presentation, and thus further reduces (in a biased fashion) the amount and representativeness of the published information available to reviewers. In other words, the conventional cure for multiple-comparison problems only makes the underlying problem worse (not unlike more than a few conventional medical cures).

Intuition about such problems may be aided by seeing their relation to regression to the mean. As is well known in screening, if we pick out values from a group based on their extremity (whether ‘extremity’ is based on absolute distance, Z-value or P-value) we will usually find our selections fall back (regress) toward the group mean on the next measurement or study. Further, this regression gets bigger as the initial value gets further from the mean. In typical settings with multiple comparisons, extremely significant associations tend to arise more from extreme errors rather than extreme effects, even if the only source of error is random.14 Regression to the mean increases if one uses a more stringent (extreme) selection cut-off, which is what conventional multiple-comparison adjustment do. Add on study validity problems, which should be of paramount concern in observational settings, and conventional multiple-comparison adjustments look like nothing more than added distortion.16,,20

Defensible methods for association selection

One solution to the reporting problem is to provide detailed information on all associations computable from the data. The general infeasibility of this solution is readily apparent: it would not be unusual for a dietary cohort study to have data on 100 food items, 40 food constituents and 30 disease outcomes, resulting in 120 000 associations if we allow only one version of each association. Occupational and medical database studies can generate even more associations. Even if complete presentation were feasible, it would drastically curtail development of careers dependent on number of publications, (so would not be adopted with enthusiasm), and it would not at all address the need to decide which of these associations is worthy of further study.

Thus, some sort of procedure is needed to pick out associations for further attention. If an association has already been called to attention by previous work, it should easily make the ‘short list’ for reporting, because such previous work needs to be checked and the results of that check need to be reported. But what of the remaining associations, those we consider probably null? As Wakefield emphasizes, the P-value alone (with or without ‘correction’) is deficient for selecting among these probably null candidates: It ignores not only Bayesian parameters (prior probabilities), but ignores frequentist parameters (power and precision) as well. In particular, selection based on conventional multiple testing such as Bonferroni adjustment is analogous to choosing a screening test for disease only to ensure a low frequency of false positives (Type I errors) among the non-diseased (high specificity), without regard to false negatives (Type II errors) or background prevalence (prior probability). It should be no surprise that such a deficient criterion leads to poor selection performance.

Frequentist empirical-Bayes methods and their Bayesian counterparts have been recommended for over three decades as replacements for conventional multiple-testing adjustments and variable selection in epidemiology.14,,22–24,28–33 These methods incorporate all the parameters relevant for selection, and perform demonstrably better than conventional adjustments for the actual purposes of most studies. In particular, conventional methods (like Bonferroni procedures) test a joint hypothesis that all the associations examined are null, under the assumption that false discovery is the worst error one can commit. More often, however, the goal is to identify or rank associations for further pursuit. Some Bayesian and empirical-Bayesian methods rank associations based on a summary of their posterior distribution, such as their posterior means or a posterior probability of exceeding a given size. These approaches seem natural, given that the methods account for anticipated regression to the mean by shrinking conventional estimates toward their prior mean,29–32 although for hypothesis screening it is arguably better to estimate the rankings directly.34

Wakefield's approach includes one of a family of methods based on Bayes factors instead of direct posterior summaries (for brevity I will leave aside his use of q-values). Like Bayesian analysis via weighted averaging,2 his approximate Bayes factor (ABF) employs a normal approximation based on the usual (maximum-likelihood) estimate of the log odds ratio θ and its variance estimate V. Wakefield justifies focusing on the Bayes factor rather than posterior probabilities by assuming the log odds ratios are exchangeable and in particular share the same normal mean zero, variance W prior. This assumption allows one to eliminate prior details when ranking associations based on their posterior odds. With his assumption, Wakefield finds
(1)
where T = V + W is the estimated total (pre-data predictive) variance for the estimated log odds ratio, and Z is the usual Wald statistic (estimate/standard error) for testing the null.

Wakefield studies the performance of this ABF in hypothesis screening (‘fishing expeditions’), especially in relation to ranking by P-values. The method and study seem suitable for the context he describes (ranking a priori exchangeable associations). Nonetheless, I have reservations about the use of Bayes factors (as opposed to rankings based on posterior summaries or direct rank estimation) in the more general setting of only partial exchangeability. At the very least, regressing out sources of non-exchangeability and operating with exchangeable residuals30,,33 would seem essential, but is not addressed by Wakefield's current method (understandably, given his examples do not appear to require such refinements).

P-values = bad Bayes factors

Just after introducing his first ABF (formula 1 above), Wakefield describes a second ABF that provides rankings identical to the P-values from the chosen Wald statistic (the Z-value). It does so by using a different prior for each log odds ratio, assuming the prior variance W proportional to the estimated sampling variance V, i.e. W = K × V for some constant K (which is a data-driven and hence empirical-Bayesian assumption). Wakefield's second approximate Bayes factor suffers from the major problem of P-values that Wakefield's first ABF was developed to address—i.e. its rankings do not depend on power or precision. It also reveals another reason that P-values are a poor choice for ranking: they implicitly correspond to implausible prior distributions.

It is conceivable that our prior variances for log odds ratios would expand as the exposures became more uncommon or the actual log odds ratios moved away from the null. Nonetheless, there is no reason to think that the prior variances would increase at a rate miraculously proportional to the estimated sampling variance. It might be tempting to think so based on confusion of actual measures with their estimates (log odds ratio estimates based on uncommon exposures do tend to have larger sampling variances). But any argument based on the size of the actual effects has to take account of the fact that effect-size rankings depend on the measure of effect. In particular, rankings can be reversed by choosing a difference instead of a ratio.35–37 Complicating matters further, estimated rankings at the far extremes can differ for different effect measures even if we first divide the estimates by their approximate standard errors (Appendix).

Conclusion

The emergence of genetic epidemiology, with its staggering number of associations to explore, has brought multiple-inference concepts into the mainstream of epidemiology and biostatistics.1,,12 It is thus time to recognize of the extent of multiple-comparison problems in everyday epidemiology and deploy modern methods toward their resolution.

Acknowledgements

The author is grateful to Charles Poole, Katherine Hoggatt and Jon Wakefield for helpful comments.

Conflict of interest: None declared.

References

1
Wakefield
J
Reporting and interpretation in genome-wide association studies
Int J Epidemiol
2008
, vol. 
37
 (pg. 
641
-
53
)
2
Greenland
S
Bayesian perspectives for epidemiologic research. I. Foundations and basic methods (with comment and reply)
Int J Epidemiol
2006
, vol. 
35
 (pg. 
765
-
78
)
3
Sellke
T
Bayarri
MJ
Berger
JO
Calibration of P values for testing precise null hypotheses
Am Stat
2001
, vol. 
55
 (pg. 
62
-
71
)
4
David
HA
First (?) occurrence of common terms in mathematical statistics
American Statistician
1995
, vol. 
49
 (pg. 
121
-
33
)
5
Stigler
S
The History of Statistics
1986
Cambridge, MA
Belknap Press
6
Fisher
RA
Statistical Methods for Research Workers
1932
4th
Edinburgh
Oliver & Boyd
7
Goodman
SN
P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate
Am J Epidemiol
1993
, vol. 
137
 (pg. 
485
-
96
)
8
Gigerenzer
G
Mindless statistics
Journal of Socioeconomics
2004
, vol. 
33
 (pg. 
567
-
606
)
9
Rothman
KJ
Greenland
S
Lash
TL
Rothman
KJ
Greenland
S, Lash
Precision and statistics in epidemiologic studies, Chapter 10
Modern Epidemiology
2008
3rd
Philadelphia
Lippincott Williams-Wilkins
(pg. 
151
-
56
)
10
Lehmann
EL
Testing Statistical Hypotheses
1986
2nd
New York
Wiley
11
Rubin
DB
Bayesianly justifiable and relevant frequency calculations
Ann Stat
1984
, vol. 
12
 (pg. 
1151
-
72
)
12
Efron
B
Bayesians, frequentists, and scientists
J Am Stat Assoc
2005
, vol. 
100
 (pg. 
1
-
5
)
13
Rothman
KJ
No adjustments are needed for multiple comparisons
Epidemiology
1990
, vol. 
1
 (pg. 
43
-
46
)
14
Greenland
S
Robins
JM
Empirical-Bayes adjustments for multiple comparisons are sometimes useful
Epidemiology
1991
, vol. 
2
 (pg. 
244
-
51
)
15
Poole
C
Multiple comparisons? No problem! (editorial)
Epidemiology
1991
, vol. 
2
 (pg. 
241
-
42
)
16
Savitz
DA
Olshan
AF
Multiple comparisons and related issues in epidemiologic research
Am J Epidemiol
1995
, vol. 
142
 (pg. 
904
-
8
)
17
Manor
O
Peritz
E
Re: “Multiple comparisons and related issues in epidemiologic research” (letter with reply)
Am J Epidemiol
1997
, vol. 
145
 (pg. 
84
-
85
)
18
Thompson
J
Re: “Multiple comparisons and related issues in the interpretation of epidemiologic data.”
Am J Epidemiol
1998
, vol. 
147
 (pg. 
801
-
6
)
19
Goodman
SN
Multiple comparis ons, explained
Am J Epidemiol
1998
, vol. 
147
 (pg. 
807
-
12
)
20
Savitz
DA
Olshan
AF
Describing data requires no adjustment for multiple comparisons: a reply from Savitz and Olshan
Am J Epidemiol
1998
, vol. 
147
 (pg. 
813
-
14
)
21
Thompson
J
A response to ‘Describing data requires no adjustment for multiple comparisons.’
Am J Epidemiol
1998
, vol. 
147
 pg. 
815
 
22
Greenland
S
Variable selection and shrinkage in the control of multiple confounders
Am J Epidemiol
2008
, vol. 
167
 (pg. 
523
-
29
)
23
Greenland
S
Bayesian methods for epidemiologic research. II. Regression analysis
Int J Epidemiol
2007
, vol. 
36
 (pg. 
195
-
202
)
24
Greenland
S
Rothman
KJ
Rothman
KJ
Greenland
S, Lash
Fundamentals of epidemiologic data analysis, Chapter 13
Modern Epidemiology
2008,
3rd
Philadelphia
Lippincott Williams-Wilkins
(pg. 
234
-
37
)
25
Sterling
TD
Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa
J Am Stat Assoc
1959
, vol. 
54
 (pg. 
30
-
34
)
26
Phillips
CV
Publication bias in situ
BMC Med Res Method
2004
, vol. 
4
 pg. 
20
  
Available at http://www.biomedcentral.com/1471-2288/4/20 (Accessed July 29, 2007)
27
Greenland
S
Commentary: On ‘Quality in epidemiological research: should we be submitting papers before we have the results and submitting more hypothesis generating research?’
Int J Epidemiol
2007
, vol. 
36
 (pg. 
944
-
45
)
28
Efron
B
Morris
CN
Data analysis using Stein's estimator and its generalizations
J Am Stat Assoc
1975
, vol. 
70
 (pg. 
311
-
19
)
29
Thomas
DC
Semiatycki
J
Dewar
R
Robins
J
Goldberg
M
Armstrong
BG
The problem of multiple inference in studies designed to generate hypotheses
Am J Epidemiol
1985
, vol. 
122
 (pg. 
1080
-
95
)
30
Greenland
S
Methods for epidemiologic analyses of multiple exposures: a review and comparative study of maximum likelihood, preliminary testing, and empirical-Bayes regression
Stat Med
1993
, vol. 
12
 (pg. 
717
-
36
)
31
Greenland
S
Poole
C
Empirical Bayes and semi-Bayes approaches to occupational and environmental hazard surveillance
Arch Environ Health
1994
, vol. 
49
 (pg. 
9
-
16
)
32
Steenland
K
Bray
I
Greenland
S
Boffetta
P
Empirical-Bayes adjustments for occupational surveillance analysis
Cancer Epidemiol Biomarkers Prev
2000
, vol. 
9
 (pg. 
895
-
903
)
33
Greenland
S
When should epidemiologic regressions use random coefficients?
Biometrics
2000
, vol. 
56
 (pg. 
915
-
21
)
34
Carlin
B
Louis
TA
Bayes and Empirical-Bayes Methods of Data Analysis
2000
2nd
New York
Chapman and Hall
35
Berkson
J
Smoking and lung cancer: some observations on two recent reports
J Am Stat Assoc
1958
, vol. 
53
 (pg. 
28
-
38
)
36
Greenland
S
Rothman
KJ
Lash
TL
Rothman
KJ
Greenland
S, Lash
Measures of effect and measures of association, Chapter 4
Modern Epidemiology
2008,
3rd
Philadelphia
Lippincott Williams-Wilkins
pg. 
62
 
37
Brumback
BA
Berg
A
On effect-measure modification: relations among changes in the relative risk, odds ratio, and risk difference
Stat Med
2008
 
; in press
38
Greenland
S
A counterexample to the test-based principle of setting confidence limits
Am J Epidemiol
1984
, vol. 
120
 (pg. 
4
-
7
)
39
Vaeth
M
On the use of Wald's test in exponential families
Int Statist Rev
1985
, vol. 
54
 (pg. 
199
-
214
)
40
Greenland
S
Rothman
KJ
Rothman
KJ
Greenland
S, Lash
Introduction to categorical statistics, Chapter 14
Modern Epidemiology
2008
3rd
Philadelphia
Lippincott Williams-Wilkins

Appendix

Different effect measures can provide different rankings of effect size.35–37 Dividing the estimated measures by their standard errors does not always remove this phenomenon and can even cause it. To see the latter possibility, compare the simple numerical examples in Table A1. Both the risk difference and the risk ratio increase going from panel 1 to panel 2, as does the Z-score (null Wald statistic) from the risk difference. But the Z-score from the log risk ratio declines. The same possibility exists for rare diseases and when using rate ratios or odds ratios in place of risk ratios. These examples reflect the fact that different asymptotic statistics are not equivalent except in neighbourhoods of the test hypothesis.38,,39 Other examples can be constructed in which the estimates agree with their own Z-score rankings, but the rankings differ for the different measures and their Z-scores.

Table A1

Hypothetical cohort example in which changing to the Z-score (estimate over standard error) reverses the ranking when using the risk ratio but not when using the risk difference, when standard errors are computed using conventional large-sample formulas40, p. 249

ExposedUnexposedExposedUnexposed
Cases 10 Cases 10 
Total 20 80 Total 20 80 
ExposedUnexposedExposedUnexposed
Cases 10 Cases 10 
Total 20 80 Total 20 80 

RD, Za    0.450, 3.93    0.475, 4.20

RR, Zb    10, 4.12    20, 4.00

aRD = Risk difference estimate, Z = RD divided by standard error.

bRR = Risk ratio estimate, Z = ln(RR) divided by standard error.

Table A1

Hypothetical cohort example in which changing to the Z-score (estimate over standard error) reverses the ranking when using the risk ratio but not when using the risk difference, when standard errors are computed using conventional large-sample formulas40, p. 249

ExposedUnexposedExposedUnexposed
Cases 10 Cases 10 
Total 20 80 Total 20 80 
ExposedUnexposedExposedUnexposed
Cases 10 Cases 10 
Total 20 80 Total 20 80 

RD, Za    0.450, 3.93    0.475, 4.20

RR, Zb    10, 4.12    20, 4.00

aRD = Risk difference estimate, Z = RD divided by standard error.

bRR = Risk ratio estimate, Z = ln(RR) divided by standard error.