## Abstract

Genetic effects for common variants affecting complex disease risk are subtle. Single genome-wide association (GWA) studies are typically underpowered to detect these effects, and combination of several GWA data sets is needed to enhance discovery. The authors investigated the properties of the discovery process in simulated cumulative meta-analyses of GWA study-derived signals allowing for potential genetic model misspecification and between-study heterogeneity. Variants with null effects on average (but also between-data set heterogeneity) could yield false-positive associations with seemingly homogeneous effects. Random effects had higher than appropriate false-positive rates when there were few data sets. The log-additive model had the lowest false-positive rate. Under heterogeneity, random-effects meta-analyses of 2–10 data sets averaging 1,000 cases/1,000 controls each did not increase power, or the meta-analysis was even less powerful than a single study (*power desert*). Upward bias in effect estimates and underestimation of between-study heterogeneity were common. Fixed-effects calculations avoided power deserts and maximized discovery of association signals at the expense of much higher false-positive rates. Therefore, random- and fixed-effects models are preferable for different purposes (fixed effects for initial screenings, random effects for generalizability applications). These results may have broader implications for the design and interpretation of large-scale multiteam collaborative studies discovering common gene variants.

Genome-wide association (GWA) platforms have caused a paradigm shift in the discovery of gene variants associated with complex traits (1). Many statistically robust associations have been identified. Nevertheless, with few exceptions, most newly discovered variants have subtle effects and explain only a small percentage of the risk variance for complex traits. To improve power, collaborative studies and collaborative meta-analyses (joint analyses) of many data sets have become essential (2–6). These approaches have already been successful in identifying new variants for type 2 diabetes (2), height (5), lipid levels (7), colorectal cancer (8), and rheumatoid arthritis (9). Besides increasing power (10), meta-analysis may also help derive more precise estimates of effects. However, there are some caveats in the discovery of associations through meta-analysis of several data sets. First, as in single studies pursuing agnostic associations without any prior biologic or functional insight, the true genetic mode of action is unknown (11–13). The impact of model misspecification has been examined in single studies (14, 15) but not in meta-analyses. Second, when many data sets are combined, an array of reasons may cause between-study heterogeneity in the genetic effect sizes. Heterogeneity increases the sample sizes required to document associations (16) and may thus impact also the discovery process in meta-analyses of genome-wide signals.

Here, we performed simulations where we investigated the properties of the discovery process in simulated cumulative meta-analyses of GWA study-derived signals. Our focus is on meta-analyses aimed at discovering new markers associated with complex traits rather than meta-analyses aimed at combining replication studies from an already reported GWA finding. We addressed the situation where data sets accumulate over time on agnostic genome-wide associations for which there is no prior knowledge about the genetic model of inheritance and between-study heterogeneity. We probed the implications of different types of model misspecification and between-study heterogeneity for various genetic effect sizes and minor allele frequencies, when a variable number of data sets have been combined.

## MATERIALS AND METHODS

### Simulation strategy and assumptions

We use the term *true model* to refer to the particular genetic model of inheritance that is considered to be the true one and that underlies each set of simulations, as well as the term *model of analysis* to refer to each of the different models that can be used for analysis, regardless of whether it is the true model or not (*misspecified models*).

We conducted simulations to investigate the discovery properties of a biallelic, single nucleotide polymorphism within a GWA setting generated under 4 potential true models: dominant, log additive (per allele), recessive, and a null model, in which the genetic variant has no effect on trait risk (odds ratio (OR) = 1.0). We similarly used 3 main models of analysis: dominant, log additive, and recessive. Modeling for linear additive inheritance (not shown here) yielded qualitatively very similar behavior to the log-additive models. Here, we coin outcomes in a way so that the odds ratio for the nonnull models would be above 1 for the minor allele.

Because most recent GWA and replication investigations have used case-control designs, each data set was simulated as such, assuming an equal number of cases and controls. The single nucleotide polymorphism was considered to have alleles *a* and *A* with population frequencies *f* and 1 − *f*, respectively, in which *a* is the minor frequency allele. In each study, we first generated controls assuming Hardy-Weinberg equilibrium and a multinomial distribution with probabilities, *P*_{control} = (*f*^{2}, 2*f* (1 − *f*), (1 − *f*)^{2}). Cases were then simulated assuming a multinomial distribution with probabilities,

*and θ*

_{aa}*are the study-specific logarithms of odds ratio (log OR) for homozygotes and heterozygotes for the susceptibility allele compared with the wild-type homozygotes. We generated study-specific effects θ*

_{aA}*and θ*

_{aa}*from a 2-dimensional normal distribution,*

_{aA}*and μ*

_{aa}*are the true population log odds ratios, τ*

_{aA}^{2}is the between-studies variance, and ρ is a correlation parameter specific to the true underlying genetic model: λ =

*c*=

*ρ*= 1 for the dominant model; ρ = 1, λ = 0.5, and

*c*= 4 for the log-additive model; λ = ρ = 0 and

*c*= 1 for the recessive model; and λ = μ

*= ρ = 0 and*

_{aa}*c*= 1 for the null model.

### Simulation parameters

Based on the assumptions described above, the combination of large-scale genetic data sets in a meta-analysis can be characterized by the following main parameters: odds ratios of susceptibility alleles (μ* _{aA}* and μ

*), population allelic frequencies (*

_{aa}*f*), individual study sample sizes, number of combined data sets, and between-study variance (τ

^{2}). Parameters for each scenario were chosen to be in a plausible range of values, consistent with insights from recent studies (1–3, 6, 8). Besides the null odds ratio (OR = 1.0), we investigated small (OR = 1.1), modest (OR = 1.3), and high-magnitude (OR = 2.0) effects with

*f*of 0.1 and 0.4. We did not deal here with very uncommon or rare variants that may have large odds ratios, since this is a situation that is often postulated but hard to demonstrate conclusively (among the rare variants, the only ones detectable would be those with large effects anyhow, while infinitesimal contributions to the risk variance would be impossible to discover).

We investigated cumulative results of up to 30 data sets, where the number of subjects per individual data set was simulated as ∼*U* (1,000, 3,000), considering an equal number of cases and controls. For each combination of these parameters, we carried out simulations both with and without heterogeneity assumptions. Heterogeneity across genome-wide data sets may be due to a differential linkage disequilibrium across samples of the measured polymorphism against the true culprit one, phenotypic heterogeneity/misclassification, differential environmental exposures that interact with genetic variance to influence phenotypes across data sets, gene-gene interactions, or any other reasons that may genuinely modify a genetic effect or introduce differential bias in different data sets (17). All these potential contributors to fluctuations in the estimated effect sizes across data sets were modeled as a single measure of between-study variance, τ^{2}, because we were ultimately interested in the global effect of heterogeneity on bias in estimates, power, and false-positive results, and it would not be straightforward to separate statistically different causes of heterogeneity. We used equally spaced heterogeneity values for τ^{2} (0, 0.025, and 0.05) representing scenarios of homogeneity, moderate heterogeneity, and strong heterogeneity of signals, respectively.

For all analyses, we set the genome-wide significance threshold for claiming discovery at *P* < 10^{−7}. The selected value of α (10^{−7}) is an approximation that should work relatively well in typical studies conducted currently in Caucasian populations. The specific genome-wide threshold may vary depending on the population tested, the sample size of studies performed, and single nucleotide polymorphism selection criteria in the genotyping platform (18).

### False-positive rate, power, and bias in effect size and between-study heterogeneity

False positives (also known as type I error) arise when associations achieve a *P* value below the selected threshold (e.g., <10^{−7} for genome-wide significance) while the true effect is null in all populations (OR = 1.0, *null genetic model*). For the null genetic model, we checked whether the simulations give appropriate type-I error rates for different specified α levels (including 10^{−4} and 10^{−5}; for 10^{−7}, we would expect to see no false positives among only 1,000,000 simulations).

We also considered the possibility that the true average effect across diverse populations is null, but the genetic variants may have small protective effects in some populations and settings and susceptibility effects in others, and the susceptibility effects are as common and as strong as protective effects (*null-average genetic model*). This could arise either with flip-flop associations where genuine effects are observed in opposite directions in different populations or when systematic errors and biases interfere in different studies, but they also act in opposite directions and cancel out across many studies. For the null-average genetic model, the term *false positives* is used for parsimony and convenience to signify those associations that achieve a *P* value below the selected threshold, while the *average* true effect is null. True effects may still be nonnull in specific populations, so the term false positives literally means here “falsely nonnull average effects.”

For scenarios with a truly associated variant, we carried out 10,000 simulations and computed the power, the average bias in estimated effect sizes, and the average bias in the estimated between-study variance. Power for nonnull models is defined as the proportion of simulated results with *P* < 10^{−7}. Selection based on significance thresholds is expected to be accompanied by inflation of the effect size (winner's curse) (19, 20). Bias was calculated as the ratio of the median observed versus the true value; observed values were those seen in the set of meta-analyses with statistically significant signals at *P* < 10^{−7}.

### Meta-analysis models

In the main analyses, data sets were combined considering the DerSimonian-Laird random-effects model (21), which assumes that studies may have heterogeneity in their results and incorporates the between-study variance in calculating the summary estimates. We also performed fixed-effects calculations for comparison applying the general inverse variance method. Fixed effects assume that there is a common effect across data sets with no heterogeneity.

## RESULTS

### False-positive rates

All genetic models of analysis under the null model resulted in no false-positive signals after considering 1,000,000 simulations, regardless of the specified genetic model and allelic frequencies, although the detection of very uncommon false positives is limited by the number of performed simulations. In these simulations, the false-positive rate for less stringent values of α (10^{−4} or 10^{−5}) was found to be appropriately estimated, suggesting that simulations have appropriate type I error properties.

Conversely, under the null-average model, false positives (or, to be more exact, falsely nonnull-average effects) were not uncommon (Table 1). There was a substantial inflation in the type I error rates for the standard DerSimonian-Laird method (using an approximation based on the standard normal distribution to test the mean of the distribution of the random effects), when applied at genome-wide thresholds. This method provided false-positive rates in the order of 90 to thousands times higher than the “expected” nominal rate at α = 10^{−7} depending on genetic model specification. Type I error rates tended to increase with increasing between-study variance, but they decreased rapidly as more studies were added. For example, at the accumulation of 4 data sets, considering a scenario of τ^{2} = 0.05 and a common allele (*f* = 0.4), the rate was 0.32% for the recessive, 0.22% for the dominant, and 0.096% for the log-additive model. At the accumulation of 10 data sets, the rates decreased from 4- to 8-fold, depending on the model. Overall, the log-additive model had the lowest rates when allelic frequency was high. In contrast, the rates were virtually 0% for the recessive model when the allelic frequency was low. With 30 data sets accumulated, the log-additive model also showed false-positive rates of 0.003% or less, even with low allele frequency. When we evaluated the test statistic of the random effects not based on the normal distribution but based on the *t* distribution with *k* − 1 df (as proposed by Follmann and Proschan (22)), the false-positive rates decreased to appropriate levels (0–1 per 1,000,000 simulations regardless of the number of studies).

Cumulative No. of Data Sets | f = 0.1 and τ^{2} = 0.025 | f = 0.1 and τ^{2} = 0.05 | f = 0.4 and τ^{2} = 0.025 | f = 0.4 and τ^{2} = 0.05 | ||||||||

Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | |

2 | 0.0835 | 0.0501 | 0 | 0.4996 | 0.3330 | 0 | 0.0803 | 0.0233 | 0.1517 | 0.5442 | 0.2185 | 0.8619 |

4 | 0.0427 | 0.0258 | 0 | 0.2122 | 0.1553 | 0 | 0.0444 | 0.0146 | 0.0722 | 0.2159 | 0.0965 | 0.3215 |

10 | 0.0124 | 0.0097 | 0 | 0.0352 | 0.0300 | 0 | 0.0143 | 0.0068 | 0.0174 | 0.0348 | 0.0241 | 0.0402 |

20 | 0.0033 | 0.0032 | 0 | 0.0043 | 0.0040 | 0 | 0.0045 | 0.0019 | 0.0045 | 0.0042 | 0.0038 | 0.0053 |

30 | 0.0015 | 0.0016 | 0 | 0.0011 | 0.0015 | 0 | 0.0009 | 0.0007 | 0.0011 | 0.0017 | 0.0027 | 0.0014 |

Cumulative No. of Data Sets | f = 0.1 and τ^{2} = 0.025 | f = 0.1 and τ^{2} = 0.05 | f = 0.4 and τ^{2} = 0.025 | f = 0.4 and τ^{2} = 0.05 | ||||||||

Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | |

2 | 0.0835 | 0.0501 | 0 | 0.4996 | 0.3330 | 0 | 0.0803 | 0.0233 | 0.1517 | 0.5442 | 0.2185 | 0.8619 |

4 | 0.0427 | 0.0258 | 0 | 0.2122 | 0.1553 | 0 | 0.0444 | 0.0146 | 0.0722 | 0.2159 | 0.0965 | 0.3215 |

10 | 0.0124 | 0.0097 | 0 | 0.0352 | 0.0300 | 0 | 0.0143 | 0.0068 | 0.0174 | 0.0348 | 0.0241 | 0.0402 |

20 | 0.0033 | 0.0032 | 0 | 0.0043 | 0.0040 | 0 | 0.0045 | 0.0019 | 0.0045 | 0.0042 | 0.0038 | 0.0053 |

30 | 0.0015 | 0.0016 | 0 | 0.0011 | 0.0015 | 0 | 0.0009 | 0.0007 | 0.0011 | 0.0017 | 0.0027 | 0.0014 |

Results are based on 1,000,000 simulations.

The false-positive signals based on the standard normal distribution more commonly suggested susceptibility effects than protective effects for the minor allele variant. This is because, with a given minor allele frequency in controls, a more extreme deviation from the null effect is needed to achieve *P* < 10^{−7} for protective associations than for susceptibility associations. Moreover, when such null-average associations were “discovered” (*P* < 10^{−7}) early from the combination of a few data sets, they had no observed between-study heterogeneity: Associations showing *P* < 10^{−7} tended to arise from seemingly highly homogeneous meta-analyses. Practically, these association signals reflect situations where all data sets had been sampled by chance from 1 tail only of the null-average distribution of genetic effects. These discovered associations seemed to even have consistent, homogeneous effects across the tested data sets, while in fact the average population effect is null.

### Genetic model misspecification: loss of power

Figure 1 shows power estimates with accumulation of data sets, when the true odds ratio = 1.3 and *f* = 0.4. The power cost of model misspecification was particularly more prominent when the true underlying trait was dominant, but a recessive mode of inheritance was specified or vice versa. This sort of extreme genetic model misspecification could cause nearly complete inability to detect small genetic effects, even when dozens of data sets were combined. In the absence of between-study heterogeneity, if the true model were dominant, the log-additive model of analysis lost some power and required 1–3 more data sets to have equivalent power to the dominant model when fewer than 10 studies were combined; with over 10 studies, power was equally maximized with both models (Figure 1A).

With small effect sizes (OR = 1.1) or small minor allele frequency (*f* = 0.1), the inferences were qualitatively similar, except that power was generally much lower. The loss of power with a misspecified log-additive model over a homogeneous, true dominant model became more prominent in small effect sizes. The log-additive model also performed poorly for recessive genetic variants with low allelic frequency, particularly when the genetic effect was large. For example, in a true recessive model with variants conferring an odds ratio of 2.0, *f* = 0.1, and modest between-study heterogeneity (τ^{2} = 0.025), power with a correct specification of the recessive model reached over 90% with 20 combined data sets, while the log-additive model barely achieved 8% power with the same data.

### Genetic model misspecification: situations of undiminished and increased power

Under moderate to high between-study heterogeneity and common genetic variants, the true genetic model was not always more powerful than a misspecified model. In particular, for a true dominant model, misspecification of a log-additive model resulted in virtually no difference or even a slight increase in power estimates (Figure 1, B and C). Similarly, for a true log-additive model, with moderate to high between-study heterogeneity, power was fairly similar for all models of analysis (Figure 1, E and F).

These seemingly counterintuitive findings arose because of upwardly biased effect sizes and/or an underestimation in the between-study variance in the misspecificied models. For example, assumption of a log-additive effect instead of the true dominant led to a median underestimation of ∼80% in the estimated between-study variance under scenarios of genuine heterogeneity after 30 simulated combined data sets. The magnitude of an odds ratio of 1.3 under a true log-additive model became ∼1.41 under a misspecified recessive model (lying, as expected, between the heterozygote and homozygote odds ratios of 1.3 and 1.69, respectively), and the median estimated between-study variance was only 0.014 when the true was 0.05.

### Increase in power with more combined data sets and power desert

Although power generally increased with accumulation of more data, a slow takeoff of power was seen in the presence of moderate to high between-study variance and log-additive or dominant models (Figure 1, C and F). Power even decreased after the first study and subsequently increased only slightly, if at all, up to the accumulation of 7–10 studies (reflecting, on average, a total of 7,000–10,000 cases and 7,000–10,000 controls), a region that can be characterized as a *power desert*. Once the cumulative sample size was beyond the range of the power desert, power increased substantially with the addition of new data sets. The power desert was nonexistent when there was no or little between-study heterogeneity.

The power desert may be attributed, at least in part, to the type I error properties of the DerSimonian-Laird method when few data sets are combined. By use of the *t* distribution with *k* − 1 df instead of the normal distribution to test the combined effect against the null hypothesis that the average random-effects estimate is zero, power with few data sets was entirely eroded and reached values of less than 30% even with 30 studies and correct model specification. This is because, with 30 studies (29 df), a *t* = 7.02 is required for *P* = 10^{−7}, while a standard normal distribution *z* = 5.32 would suffice. The *P* value corresponding to *z* = 7.02 would be *P* = 2 × 10^{−12}, thus making for an exceedingly demanding discovery threshold.

### Bias in effect and heterogeneity estimates for discovered genetic variants

For variants discovered with few data sets (i.e., underpowered conditions), there was an anticipated bias in the effect estimates (Figure 2). Bias was higher for scenarios of lower statistical power: with low effect sizes, low minor allelic frequencies, and higher between-study variability. For example, for *f* = 0.1, odds ratio = 1.1, and genuine heterogeneity (τ^{2} = 0.025) across data sets, discovered signals arising from 15 or fewer combined data sets were, on average, 142% upwardly biased (range from 114% to 310%) even if a log-additive model had been correctly specified. With many data sets accumulated, the effects derived from correctly specified models tended to become unbiased, whereas summary effects derived from misspecified models still carried the impact of the misspecification.

In addition, there was a substantial underestimation of the true between-study variance in all examined scenarios with heterogeneity. The extent of the underestimation decreased with the accumulation of more data sets. Figure 3 presents 3 indicative scenarios showing underestimation of the true variability of effects. Of note, when the true model is log additive with τ^{2} = 0.05 for θ* _{aA}*, both dominant and recessive misspecified models tend to converge toward apparently inflated estimates of τ

^{2}in the range of 0.08. However, this seeming inflated estimation is an artifact, because τ

^{2}(θ

*) is then 0.20 in the true log-additive model and, thus, misspecified models continue to provide deflated heterogeneity estimates.*

_{aa}The underestimation of the between-study variance with few studies also contributes to the power desert phenomenon: For a true but heterogeneous association, as more studies accumulate confidence intervals tend to become smaller, because of the larger sample size, but concurrently they tend to become wider, because the between-study variance is more properly calculated; the net effect is no increase, or even a decrease, in power. This is further compounded by the peculiarly high type I error rates of the random-effects model with normality assumptions when there are only a few data sets.

### Discovery properties based on fixed-effects calculations

When data were combined with a fixed-effects model that ignores between-study heterogeneity in the calculations, unless the true model is recessive and the minor allele frequency is low (*f* = 0.1), the power estimates increased considerably fast, in some cases much faster than by using random effects for data synthesis. No power desert was seen in any situation. Figure 4 shows the results from a common variant with *f* = 0.4. The magnitude of the bias in effect sizes was of comparable magnitude for discovery based on fixed effects as with random effects.

With a homogeneous null genetic model (no variability in the null effect in any population), all simulations yielded no false-positive results (data not shown). However, in the presence of moderate to high between-study variability with null-average effects, there was a large increase in the amount of false-positive claimed discoveries derived by fixed-effects compared with random-effects models (Table 2). False positives became more common with accumulation of more data sets. The false-positive rate was very high with log-additive and even higher with dominant models when the allelic frequencies were low (*f* = 0.1), but it was more remarkable for the recessive model when the trait-associated allele increased in frequency. For example, with τ^{2} = 0.05, a log-additive analysis with 20 data sets (20,000 cases and 20,000 controls) and *f* = 0.1 yielded a 1.18% false-positive rate at α = 10^{−7}. With increasing accumulated data, the false-positive rate with fixed-effects calculations exceeded the false-positive rate of random-effects models by over 100- or even 1,000-fold.

Cumulative No. of Data Sets | f = 0.1 and τ^{2} = 0.025 | f = 0.1 and τ^{2} = 0.05 | f = 0.4 and τ^{2} = 0.025 | f = 0.4 and τ^{2} = 0.05 | ||||||||

Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | |

2 | 0.1614 | 0.0826 | 0 | 1.3594 | 0.8300 | 0 | 0.1778 | 0.0520 | 0.3487 | 1.4310 | 0.5749 | 2.3932 |

4 | 0.1706 | 0.0876 | 0 | 1.4270 | 0.8949 | 0 | 0.1821 | 0.0507 | 0.3789 | 1.4644 | 0.5914 | 2.5262 |

10 | 0.1779 | 0.0962 | 0 | 1.5750 | 0.9986 | 0.0002 | 0.1795 | 0.0524 | 0.3885 | 1.4837 | 0.6175 | 2.7443 |

20 | 0.1979 | 0.1133 | 0 | 1.8023 | 1.1789 | 0.0001 | 0.1936 | 0.0537 | 0.4197 | 1.5305 | 0.6696 | 2.9607 |

30 | 0.2164 | 0.1276 | 0 | 2.0274 | 1.3598 | 0 | 0.1832 | 0.0661 | 0.4383 | 1.6034 | 0.7513 | 3.2084 |

Cumulative No. of Data Sets | f = 0.1 and τ^{2} = 0.025 | f = 0.1 and τ^{2} = 0.05 | f = 0.4 and τ^{2} = 0.025 | f = 0.4 and τ^{2} = 0.05 | ||||||||

Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | Dominant | Log Additive | Recessive | |

2 | 0.1614 | 0.0826 | 0 | 1.3594 | 0.8300 | 0 | 0.1778 | 0.0520 | 0.3487 | 1.4310 | 0.5749 | 2.3932 |

4 | 0.1706 | 0.0876 | 0 | 1.4270 | 0.8949 | 0 | 0.1821 | 0.0507 | 0.3789 | 1.4644 | 0.5914 | 2.5262 |

10 | 0.1779 | 0.0962 | 0 | 1.5750 | 0.9986 | 0.0002 | 0.1795 | 0.0524 | 0.3885 | 1.4837 | 0.6175 | 2.7443 |

20 | 0.1979 | 0.1133 | 0 | 1.8023 | 1.1789 | 0.0001 | 0.1936 | 0.0537 | 0.4197 | 1.5305 | 0.6696 | 2.9607 |

30 | 0.2164 | 0.1276 | 0 | 2.0274 | 1.3598 | 0 | 0.1832 | 0.0661 | 0.4383 | 1.6034 | 0.7513 | 3.2084 |

Results are based on 1,000,000 simulations.

## DISCUSSION

Our simulations explore the discovery properties of GWA signals from cumulatively combined data sets. In the presence of unknown genetic model and concomitant potential misspecification, the power to detect modest or small association signals discovered from GWA platforms can be markedly eroded. However, we observed also several circumstances where power would remain intact or even increase despite misspecification. In all, assumption of a log-additive mode of inheritance, even if misspecified, should not incur major power loss, perhaps with the exception of recessive variants. The latter would be very difficult to identify even if correctly modeled. In some circumstances, a power desert was encountered in our simulations with random-effects models, where an increase in the number of heterogeneous data sets and cumulative sample size did not seem to increase or even seemed to reduce the power to detect true genetic effects until a minimum number of data sets had been combined (16). This suggests that prospective meta-analyses initiatives should not be discouraged if they have few or no new variants that pass genome-wide significance by both fixed and random effects when the cumulative sample size is in the range of 2,000–20,000 participants. Instead, one should pursue a further increase in sample size to escape the power desert area. Single studies with over 10,000 cases (and as many controls) are uncommon anyhow. Conversely, collaborative efforts have shown that much larger sample sizes can be amassed for studying complex traits by the coalition of many teams of investigators. Examples include the Diabetes Genetics Replication and Meta-analysis (DIAGRAM) initiative (2), the Genome-wide Investigation of Anthropometric Measures (GIANT) consortium (5), and Genetic Markers for Osteoporosis (GENOMOS)/Genetic Factors for Osteoporosis (GEFOS) (23).

Our findings also suggest that many true, successfully detected effects are likely to be inflated, even when several data sets are combined. This extends observations about the expected prevalence of winner's curse in single studies (19). A thorough knowledge and a greater awareness of the potentially biased estimates in meta-analyses of several data sets can safeguard new replication studies against lack of power and may lead to more cautious estimation of the proportion of variance explained and population attributable risk.

Besides discovery of new true associations, avoidance of false positives is also important. A worrisome scenario is the discovery of genome-wide signals of seemingly homogeneous, consistent associations when some genetic variants have null effects on average, but there is considerable between-population heterogeneity with susceptibility effects in some data sets and protective effects in others. The frequency of genuine genetic flip-flop is unknown (24). If some genetic effects are genuinely present in one direction in some populations and in the opposite direction in others, then detection of these associations is welcome, even if only one direction is discovered first. Similarly, one can only speculate on how frequently errors or biases acting in opposite directions in different data sets may cause a flip-flop equivalent. It is worrisome when such null-average variants are discovered as nonnull associations, because they tend to have seemingly highly consistent effects across data sets without observed between-study heterogeneity. At face value, they emerge as seemingly excellent, consistent, high-credibility associations, and they are more frequently suggesting susceptibility rather than protective effects. Perusal of further replication efforts may reveal between-data set heterogeneity and would eliminate almost all of these signals (at least by random-effects calculations); conversely, genuine homogeneous nonnull associations will continue to show strong signals. This suggests that even associations that have crossed genome-wide significance with several data sets accumulating 20,000 or even more participants are worthwhile to go through further replication (25, 26).

Fixed-effects calculations ignore the potential heterogeneity that may exist across data sets, and thus they increase power and avoid power deserts. Therefore, fixed effects may be preferable for the purposes of initial discovery, if the aim is simply to screen and identify as many of the true variants as possible. However, with fixed effects, the rate of false positives increases substantially as more data accumulate, even up to 100- or 1,000-fold or more compared with random-effects calculations, when effects are truly null on average, but there is heterogeneity with study-specific effects in both directions. Therefore, if there is any concern that bias or errors may cause even small deviations in either direction, fixed-effects calculations may yield spurious discoveries. Associations that pass desired significance thresholds with fixed-effects calculations, but not with random-effects calculations, may require further replication. Given that the costs of genotyping technologies are decreasing and there is a need to maximize the number of discovered new variants, investigators may prioritize a fixed-effects model to retain most truly association signals at the expense of replicating several false-positive results. On the other hand, if the emphasis is on building predictive models with validated markers, incorporating variants that produce heterogeneous results and reach genome-wide significance only by a fixed-effects model may be misleading.

A caveat with both fixed- and random-effects models, as applied here, is the normality distributional assumptions, as well as the reliance on an approximation regardless of the number of studies being combined. These are easier to check (but this is rarely done) for fixed effects although, for random effects, evaluation of normality assumptions for the distribution of population effects is usually impossible. Our results suggest that between-study heterogeneity may substantially inflate the type I error rate for the DerSimonian-Laird test, particularly at stringent α levels. Alternative permutation approaches to maintain a more adequate rate of false-positive signals exist (22), but the computational burden poses a clear impediment to the genome-wide setting, particularly when very low levels of statistical significance are pursued, since the number of requested permutations can become very time consuming or even prohibitive. Methods based on heavy-tail distributions are also available but have not been widely used (27). We found that using the *t* distribution with *k* − 1 df instead of the normal approximation (22) corrects the type I error but drastically erodes the power for discovery. The loss of power has been shown to be small for traditional applications of meta-analysis where α = 0.05 (22), because, for example, for 30 studies *t* = 2.045 is very close to *z* = 1.96. However, for genome-wide–level discoveries where α = 10^{−7}, the deviation between *t* and *z* (7.02 vs. 5.32) has more major consequences. Finally, there are iterative approaches for the estimation of the between-study variance (28) that are usually substantially uncertain when few data sets are available.

Some additional caveats should be discussed. First, we have modeled the accumulation of several studies of 1,000–3,000 participants each. In reality, studies may have considerably unequal sample sizes. One can use our simulation engine to probe scenarios of specific sample sizes being combined. However, the basic concepts should be robust in the absence of extreme situations. Second, instead of combining several studies, a very large study may be performed upfront for discovery purposes, and newly designed large studies are welcome (29). Regardless, the cumulative evidence from the combination of all available data sets (both existing and newly designed) should give the most comprehensive picture of the total evidence. Our simulations highlight the caveats that may arise from the play of between-study heterogeneity. Thus, heterogeneity from errors, biases, and other correctable sources should ideally be minimized and, if possible, eliminated (1). Third, we have not investigated the discovery properties for variants with minor allele frequency below 10%, but for such variants even extremely large sample sizes may have low power. We need more empirical evidence on how much of the genetic architecture we can unearth by pursuing common variants with minor allele frequencies exceeding 10% (29). Fourth, we have addressed several typical genetic models of inheritance, but others are also possible, for example, a co-dominant model with linear additive inheritance, heterosis, and associations that fail to fit a classic model.

Our simulations can be used in assessing the incremental power, change in false-positive rate, and improvement in effect size and heterogeneity estimates that can be attained with the accumulation of a specific amount of additional data. Discovery and replication are ongoing, open-ended processes. Extensive, long-reach replication may be pivotal in discriminating truly consistent from seemingly consistent associations (30). Given the difficulties of the simple identification of each variant, confident specification of the genetic model for each emerging variant may be too demanding a goal. This may need to await better characterization of the variant and its linked local region (31) or even functional data (32). Moreover, inferences about exact effect sizes and heterogeneity estimates should be cautious until large cumulative sample sizes have been assembled. Use of this information for risk prediction across different populations may be uncertain, and wide generalizations may be premature (33). However, the most positive aspect of our simulations is that, for many settings, power typically can become reasonable and estimates can be accurate, if large enough sample sizes are assembled, either in existing studies or new, explicitly designed cohorts (34, 35).

### Abbreviations

- GWA
genome-wide association

- OR
odds ratio

Author affiliations: Laboratory of Genetics and Molecular Cardiology, Heart Institute (InCor), University of São Paulo Medical School, São Paulo, Brazil (Tiago V. Pereira); Department of Biochemistry and Molecular Biology, Federal University of São Paulo, São Paulo, Brazil (Tiago V. Pereira); Clinical and Molecular Epidemiology Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece (Tiago V. Pereira, Nikolaos A. Patsopoulos, Georgia Salanti, John P. A. Ioannidis); Biomedical Research Institute, Foundation for Research and Technology—Hellas, Ioannina, Greece (John P. A. Ioannidis); and Tufts Clinical and Translational Science Institute and Center for Genetic Epidemiology and Modeling, Institute for Clinical Research and Health Policy Studies, Tufts Medical Center and Department of Medicine, Tufts University School of Medicine, Boston, Massachusetts (John P. A. Ioannidis).

Tiago Pereira holds a fellowship from the Coordenação de Aperfeiçoamento Pessoal de Nível Superior (CAPES, Brazilian Ministry of Education) and was supported by the Wood-Whelan Research Fellowship (International Union of Biochemistry and Molecular Biology) to be a visiting researcher at the University of Ioannina for this project. Nikolaos Patsopoulos is funded by a PENED grant from the European Union and the General Secretariat for Research and Technology, Greece. Scientific support for this project was provided through the Tufts Clinical and Translational Science Institute under funding from the National Center for Research Resources (UL1 RR025752), National Institutes of Health.

Points of view or opinions in this paper are those of the authors and do not necessarily represent the official position or policies of the Tufts Clinical and Translational Science Institute.

Conflict of interest: none declared.

## References

*CD40*and other loci confer risk of rheumatoid arthritis

*LRP5*and

*LRP6*variants and osteoporosis

## Author notes

**Editor's note**: This article also appears on the website of the Human Genome Epidemiology Network (http://www.cdc.gov/genomics/hugenet/default.htm).