-
PDF
- Split View
-
Views
-
Cite
Cite
Buhm Han, Dat Duong, Jae Hoon Sul, Paul I. W. de Bakker, Eleazar Eskin, Soumya Raychaudhuri, A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping, Human Molecular Genetics, Volume 25, Issue 9, 1 May 2016, Pages 1857–1866, https://doi.org/10.1093/hmg/ddw049
- Share Icon Share
Abstract
Meta-analysis strategies have become critical to augment power of genome-wide association studies (GWAS). To reduce genotyping or sequencing cost, many studies today utilize shared controls, and these individuals can inadvertently overlap among multiple studies. If these overlapping individuals are not taken into account in meta-analysis, they can induce spurious associations. In this article, we propose a general framework for adjusting association statistics to account for overlapping subjects within a meta-analysis. The key idea of our method is to transform the covariance structure of the data, so it can be used in downstream analyses. As a result, the strategy is very flexible and allows a wide range of meta-analysis methods, such as the random effects model, to account for overlapping subjects. Using simulations and real datasets, we demonstrate that our method has utility in meta-analyses of GWAS, as well as in a multi-tissue mouse expression quantitative trait loci (eQTL) study where our method increases the number of discovered eQTL by up to 19% compared with existing methods.
Introduction
Meta-analysis of genome-wide association studies (GWAS) has become crucial to scale up genetic discovery (1,2). Using meta-analysis, investigators are able to combine many studies to dramatically increase sample size and thereby to increase statistical power to detect phenotypically associated variants. In recent years, large-scale meta-analyses have become essential to increase the number of known associated loci in many human diseases (3,4). Indeed, almost all GWAS publications today employ meta-analysis methodologies (5).
A range of meta-analysis methods are now commonly used in GWAS (6). The fixed effects (FE) model, which assumes that effect sizes are fixed across studies, is powerful when its assumption holds true (7,8). The random effects (RE) model assumes that the effect sizes can be different between studies, a phenomenon called heterogeneity (9). We recently proposed a RE model that is more powerful than FE if the data are heterogeneous (10). Additional categories of meta-analysis methods include the P-value-based approaches (11,12); the subset approaches, which assume that the effects are present or absent in studies (13,14); and the Bayesian approaches (15–17). The majority of these methods assume that component studies are independent and that the individuals collected in one study are separate and unrelated from those in other studies.
Many studies today utilize shared controls to reduce genotyping or sequencing cost. For example, the 1958 British Birth Cohort (18) is being widely used in recent GWAS (19–23). As a result, there can be inadvertent overlap between individuals across multiple studies. The resulting dependencies can induce spurious associations if the overlapping subjects are not explicitly removed prior to meta-analysis or taken into account during meta-analysis. A simple solution would be to manually split the overlapping subjects into separate studies, but this can be suboptimal in terms of statistical power (24). Moreover, splitting genotype data is often not possible in practice if the primary genotype data are not shared; this is a common situation in current GWAS due to the sensitivities of sharing individual-level data.
Previous studies addressed this challenge, but only focusing on specific meta-analysis models. Lin and Sullivan (24) proposed a FE meta-analysis approach that takes into account overlapping subjects. Zaykin and Kozbur (12) and Bhattacharjee et al. (13) extended this method to the P-value-based approach and subset approach, respectively. None of these approaches are able to account for dependency structure in other meta-analysis frameworks such as the RE models. Application of the RE models to meta-analyses recently became crucial because of its potential to go beyond standard study designs. Investigators are now increasingly using meta-analysis to combine data from multiple related diseases in order to identify shared pathways (25,26), to combine data from multiple tissues to identify shared expression quantitative trait loci (eQTL) (27,28) and to combine neuroimaging data of different psychiatric diseases to identify shared brain features (29). In these new study designs, the use of RE models is ideal because the effect sizes are not necessarily the same between datasets. However, if dependency structure exists in the data, the current RE models cannot be applied due to the independence assumption.
Another practical challenge in using the current solutions to dependency structure is that one cannot use widely used pipelines for GWAS meta-analysis. Many analysis pipelines utilize standard software packages implementing meta-analysis models (2,30). These packages have advantages in solving practical issues in data cleaning, quality control and imputation (2). However, many of these do not support overlapping subjects and therefore cannot be used if subjects are shared between studies.
In this article, we propose a flexible and general framework for meta-analyzing dependent studies. The philosophy of our framework is unique, in that our framework does not compete with existing methods; it enables them to account for overlapping subjects. Given dependent studies, our framework transforms the covariance structure of the data, so it can be used in methods that strictly assume independence between studies. Specifically, our method decouples dependent studies into independent studies, adjusting association statistics to account for uncertainties in dependent studies. As a result, our approach enables general meta-analysis methods, including the FE and RE models, to account for overlapping subjects. Existing pipelines implementing these models can be reused for analyzing dependent studies, as long as our framework is applied as a front end of the analysis procedure.
We applied our approach to the eQTL data of Sul et al. (27), which involves data from multiple mouse tissues. Using our framework to combine information from multiple tissues and to account for tissue heterogeneity with the RE model, we increased the number of discovered eQTL by up to 19% compared with existing methods. We also performed a meta-analysis of three autoimmune diseases with shared controls using the Wellcome Trust Case Control Consortium (WTCCC) data (18) to demonstrate that our framework improved statistical significance of known candidate loci associated to multiple autoimmune diseases.
Results
Overview of the method
We propose a general approach that accounts for the dependency structure between studies in a meta-analysis. As we review in the Materials and Methods section, many traditional as well as recently proposed meta-analysis methods cannot account for overlapping subjects (Supplementary Material, Table S1). Our goal was to develop a framework that could be generally applied to existing approaches, so that methods not yet accounting for dependency could, and so that existing software packages could be reused. To this end, we can transform the covariance structure of the data by decoupling dependent studies into independent studies. Specifically, to account for uncertainties between studies that are dependent, we increase the standard errors in the decoupled studies. In Figure 1, we have three studies, A, B and C, whose statistics are correlated and have unit variances. Our approach decouples these studies into independent studies that have the same information with respect to the summary statistic. The variances of A, B and C increase to 1.39, 2.52 and 1.24, respectively. The variance of Study B increases the most drastically to 2.52 because B was highly correlated to A and C, with correlations of 0.5 and 0.3, respectively. High correlations to other studies imply that this study has the smallest information content as an independent study. We can then use the decoupled studies in the downstream meta-analysis approaches.

A simple example of our decoupling approach. Ω and are the covariance matrices of the statistics of three studies A, B and C before and after decoupling, respectively. The thickness of the edges denotes the amount of correlation between the studies. After decoupling, the size of the nodes reflects the information that the studies contain in terms of the inverse variance.
False-positive rate
One of the primary goals of decoupling is to prevent spurious associations that can occur if we ignore overlapping subjects in a meta-analysis. Thus, we performed simulations to compare the false-positive rates of different methods: (a) a naive method that does not account for overlapping subjects, (b) a splitting method that splits shared controls into individual studies and (c) a decoupling method that systematically accounts for overlapping subjects. We simulated genotypes assuming a SNP with the minor allele frequency (MAF) P = 0.3. We constructed five studies whose numbers of cases are 1000, 2000, 3000, 4000 and 5000. We assumed that these studies share 10 000 controls. In the splitting method, we equally split controls, allotting 2000 controls per study. Under the null hypothesis of no association of the SNP to the trait, we generated 1 million such study sets. We applied the FE model to all three methods.
Table 1 shows that both decoupling and splitting correctly control the false-positive rate. However, the naive method that ignores overlapping subjects does not control the false-positive rate correctly. We calculated the ratios between the estimated false-positive rates and the significance threshold α. The ratios for decoupling and splitting are consistently close to 1.0 (mean ratio over different thresholds, 0.98 for decoupling and 1.08 for splitting). However, the ratio of the naive method is very large (mean ratio of 6.65). In fact, the ratio becomes larger as the threshold becomes more significant, the maximum being 13.56 at threshold α = 5 × 10−5. This phenomenon emphasizes that ignoring overlapping subjects will incur a large number of false positives at the extreme threshold of 5 × 10−8 that is typically used in GWAS. We also applied the RE model of Han and Eskin (10) to the three methods. Decoupling conservatively controlled false-positive rate (mean ratio of 0.70), and the naive method again had an inflated false-positive rate (mean ratio of 5.40, see Supplementary Material, Table S2). Additionally, to explore a wider range of realistic scenarios, instead of fixing sample sizes of five studies, we randomly sampled sample sizes for simulations. We sampled case sample size of each study from N(3000, 10002), with lower bound 500, and assumed study-specific control samples whose size is sampled from N(3000, 10002). We assumed 10 000 shared controls and similarly repeated 1 million null simulations. Supplementary Material, Table S3 shows that the results are largely unchanged. Decoupling and splitting controlled false-positive rate well. Decoupling RE showed similar conservative tendency (mean ratio of 0.72).
Threshold α | 0.05 | 0.01 | 0.005 | 0.001 | 5 × 10−4 | 1 × 10−4 | 5 × 10−5 |
Decoupling | 0.0496 (0.99) | 0.00997 (1.00) | 0.00499 (1.00) | 0.000992 (0.99) | 0.0005 (1.00) | 8.7 × 10−5 (0.87) | 4.9 × 10−5 (0.98) |
Splitting | 0.0499 (1.00) | 0.01 (1.00) | 0.00501 (1.00) | 0.00104 (1.04) | 0.000539 (1.08) | 0.000123 (1.23) | 5.9 × 10−5 (1.18) |
Naive | 0.1 (2.00) | 0.0308 (3.08) | 0.0186 (3.72) | 0.0059 (5.90) | 0.00354 (7.09) | 0.00112 (11.20) | 0.000678 (13.56) |
Threshold α | 0.05 | 0.01 | 0.005 | 0.001 | 5 × 10−4 | 1 × 10−4 | 5 × 10−5 |
Decoupling | 0.0496 (0.99) | 0.00997 (1.00) | 0.00499 (1.00) | 0.000992 (0.99) | 0.0005 (1.00) | 8.7 × 10−5 (0.87) | 4.9 × 10−5 (0.98) |
Splitting | 0.0499 (1.00) | 0.01 (1.00) | 0.00501 (1.00) | 0.00104 (1.04) | 0.000539 (1.08) | 0.000123 (1.23) | 5.9 × 10−5 (1.18) |
Naive | 0.1 (2.00) | 0.0308 (3.08) | 0.0186 (3.72) | 0.0059 (5.90) | 0.00354 (7.09) | 0.00112 (11.20) | 0.000678 (13.56) |
Naive refers to a method that ignores overlapping subjects. Splitting refers to a strategy that splits shared controls into individual studies. All three methods used the FE model. The numbers in parentheses are the ratios of estimated false-positive rates and the threshold α. We expect these numbers to be close to 1.0.
Threshold α | 0.05 | 0.01 | 0.005 | 0.001 | 5 × 10−4 | 1 × 10−4 | 5 × 10−5 |
Decoupling | 0.0496 (0.99) | 0.00997 (1.00) | 0.00499 (1.00) | 0.000992 (0.99) | 0.0005 (1.00) | 8.7 × 10−5 (0.87) | 4.9 × 10−5 (0.98) |
Splitting | 0.0499 (1.00) | 0.01 (1.00) | 0.00501 (1.00) | 0.00104 (1.04) | 0.000539 (1.08) | 0.000123 (1.23) | 5.9 × 10−5 (1.18) |
Naive | 0.1 (2.00) | 0.0308 (3.08) | 0.0186 (3.72) | 0.0059 (5.90) | 0.00354 (7.09) | 0.00112 (11.20) | 0.000678 (13.56) |
Threshold α | 0.05 | 0.01 | 0.005 | 0.001 | 5 × 10−4 | 1 × 10−4 | 5 × 10−5 |
Decoupling | 0.0496 (0.99) | 0.00997 (1.00) | 0.00499 (1.00) | 0.000992 (0.99) | 0.0005 (1.00) | 8.7 × 10−5 (0.87) | 4.9 × 10−5 (0.98) |
Splitting | 0.0499 (1.00) | 0.01 (1.00) | 0.00501 (1.00) | 0.00104 (1.04) | 0.000539 (1.08) | 0.000123 (1.23) | 5.9 × 10−5 (1.18) |
Naive | 0.1 (2.00) | 0.0308 (3.08) | 0.0186 (3.72) | 0.0059 (5.90) | 0.00354 (7.09) | 0.00112 (11.20) | 0.000678 (13.56) |
Naive refers to a method that ignores overlapping subjects. Splitting refers to a strategy that splits shared controls into individual studies. All three methods used the FE model. The numbers in parentheses are the ratios of estimated false-positive rates and the threshold α. We expect these numbers to be close to 1.0.
Power
Based on this frequency, we generated 10 000 simulation sets, each consisting of 5 case/control studies. In each set, we calculate the FE model P-values for both decoupling and splitting. We approximated power as the proportion of the sets in which the P-value exceeded the threshold of 5 × 10−8. We gradually increased γ and evaluated the power of the decoupling and splitting methods. Figure 2 shows that decoupling consistently outperforms splitting at all the relative risks we tested. For example, at relative risk 1.13, the power of decoupling is 78.2%, whereas the power of splitting is 70.4%. This result is in line with Lin and Sullivan's (24) observation that their approach (which is equivalent to decoupling applied to FE) outperforms splitting, unless controls are split into studies proportionally to case sample sizes, in which situation both methods will have similar power.

Power of different methods. Splitting refers to a strategy that splits shared controls into individual studies.
One advantage of decoupling is that it allows a wide range of meta-analysis methods to account for overlapping subjects, including the RE model. Using simulations, we demonstrated that we can gain power by using decoupling with the RE model in situations where effect sizes are heterogeneous between studies. Assuming the same simulation frameworks as in the previous power simulations, we additionally assumed that only two out of the five studies have genetic effects with relative risk 1.2. We assumed that the two studies having the largest case sample sizes (4000 and 5000) have effects. Then, after decoupling, we applied both the FE model and Han and Eskin's (10) RE model. We also used the splitting method and applied FE and RE. Figure 3 shows that in this situation, RE gains power over FE. When applied after decoupling, the power of RE is 86%, whereas the power of FE is 61%. Thus, we gained 25% more power by decoupling, because decoupling allowed the application of RE for dependent studies. Splitting has suboptimal power in this simulation as well. When applied after splitting, the power of RE and FE is only 51 and 14%, respectively. Note that the power difference between decoupling FE and splitting FE is even more prominent in this simulation than in Figure 2, because of our design that assumes effects in the two studies having the largest case sample sizes. Since we equally split controls in our simulation, which is equivalent to down-weighting studies having large case sizes, the power drop of splitting compared with decoupling is dramatic in this design.

Power of different methods when heterogeneity exists. We assumed that only two out of five studies had an effect size of relative risk 1.2. Splitting refers to a strategy that splits shared controls into individual studies. FE, fixed effects model. RE, random effects model of Han and Eskin (10).
Cross-disease analysis using the WTCCC data
One application of the decoupling approach is cross-disease analysis, in which controls are shared between diseases. Because it is reasonable to assume that effect size varies between different diseases, it is ideal to use the RE model. Thus, in this design, decoupling can help because it allows the application of RE models when studies share controls. As an example of cross-diseases analysis, we applied our decoupling approach to the WTCCC data. The WTCCC has performed GWAS on seven diseases, including three autoimmune diseases (Crohn's disease, rheumatoid arthritis and type 1 diabetes; CD, RA and T1D in short), using a set of shared controls (18). We performed meta-analysis on the three autoimmune diseases to identify shared loci between them.
We compared three methods: the naive method that ignores the fact that the studies share controls, the splitting method that splits controls to each disease and the decoupling method. For all three methods, we applied both the FE model and the Han and Eskin's (10) RE model. In the splitting method, we assumed that each individual is assigned to one of the three diseases with equal probability. We considered 397 450 SNPs that passed quality control for all three diseases in the original publication (18) and whose MAF are greater than 1%.
Supplementary Material, Figure S1 shows that the qq-plot is inflated for the naive method, which ignores overlapping subjects, for both FE and RE. The genomic control factors (31) were and for FE and RE, respectively. In contrast, the splitting and decoupling methods did not have such inflation. The genomic control factors were and for splitting, and and for decoupling.
We then examined the P-values of three non-MHC loci that were originally reported in the WTCCC analyses as candidates for shared loci between autoimmune diseases. Both splitting and decoupling gave significant P-values at those loci (P< 1.2 × 10−7, Bonferroni corrected for 397 450 tests, see Table 2). However, decoupling gave much more significant P-values than splitting did. For example, at PTPN22, the decoupling RE P-value (1.00 × 10−29) is about seven orders of magnitude more significant than the splitting RE P-value (1.88 × 10−22). At these loci, RE gave more significant P-values than FE did for both splitting and decoupling (Table 2 and Supplementary Material, Fig. S2A). This is due to the effect size heterogeneity at these loci between different diseases.
Association results of a meta-analysis combining CD, RA and T1D from the WTCCC data
Gene . | RSID . | Chr. . | Position . | Meta-analysis P-values . | |||
---|---|---|---|---|---|---|---|
Splitting FE . | Splitting RE . | Decoupling FE . | Decoupling RE . | ||||
PTPN22 | rs6679677 | 1 | 114 015 850 | 1.45 × 10−13 | 1.88 × 10−22 | 4.48 × 10−23 | 1.00 × 10−29 |
SH2B3 | rs17696736 | 12 | 110 949 538 | 1.93 × 10−09 | 1.18 × 10−09 | 7.72 × 10−10 | 3.40 × 10−10 |
PTPN2 | rs2542151 | 18 | 12 769 947 | 1.14 × 10−07 | 9.03 × 10−08 | 3.74 × 10−08 | 5.16 × 10−08 |
Gene . | RSID . | Chr. . | Position . | Meta-analysis P-values . | |||
---|---|---|---|---|---|---|---|
Splitting FE . | Splitting RE . | Decoupling FE . | Decoupling RE . | ||||
PTPN22 | rs6679677 | 1 | 114 015 850 | 1.45 × 10−13 | 1.88 × 10−22 | 4.48 × 10−23 | 1.00 × 10−29 |
SH2B3 | rs17696736 | 12 | 110 949 538 | 1.93 × 10−09 | 1.18 × 10−09 | 7.72 × 10−10 | 3.40 × 10−10 |
PTPN2 | rs2542151 | 18 | 12 769 947 | 1.14 × 10−07 | 9.03 × 10−08 | 3.74 × 10−08 | 5.16 × 10−08 |
The three non-MHC loci reported in the WTCCC combined analysis (18) are presented. The candidate gene for each locus was obtained from the ImmunoBase website (www.immunobase.org).
Association results of a meta-analysis combining CD, RA and T1D from the WTCCC data
Gene . | RSID . | Chr. . | Position . | Meta-analysis P-values . | |||
---|---|---|---|---|---|---|---|
Splitting FE . | Splitting RE . | Decoupling FE . | Decoupling RE . | ||||
PTPN22 | rs6679677 | 1 | 114 015 850 | 1.45 × 10−13 | 1.88 × 10−22 | 4.48 × 10−23 | 1.00 × 10−29 |
SH2B3 | rs17696736 | 12 | 110 949 538 | 1.93 × 10−09 | 1.18 × 10−09 | 7.72 × 10−10 | 3.40 × 10−10 |
PTPN2 | rs2542151 | 18 | 12 769 947 | 1.14 × 10−07 | 9.03 × 10−08 | 3.74 × 10−08 | 5.16 × 10−08 |
Gene . | RSID . | Chr. . | Position . | Meta-analysis P-values . | |||
---|---|---|---|---|---|---|---|
Splitting FE . | Splitting RE . | Decoupling FE . | Decoupling RE . | ||||
PTPN22 | rs6679677 | 1 | 114 015 850 | 1.45 × 10−13 | 1.88 × 10−22 | 4.48 × 10−23 | 1.00 × 10−29 |
SH2B3 | rs17696736 | 12 | 110 949 538 | 1.93 × 10−09 | 1.18 × 10−09 | 7.72 × 10−10 | 3.40 × 10−10 |
PTPN2 | rs2542151 | 18 | 12 769 947 | 1.14 × 10−07 | 9.03 × 10−08 | 3.74 × 10−08 | 5.16 × 10−08 |
The three non-MHC loci reported in the WTCCC combined analysis (18) are presented. The candidate gene for each locus was obtained from the ImmunoBase website (www.immunobase.org).
To test the robustness of methods against noise, we added the other four diseases as noise into the meta-analysis. Supplementary Material, Figure S2B shows that when we meta-analyze all seven diseases, the absolute magnitudes of the significance of PTPN22 are reduced, but it is still significant for both the decoupling FE and the decoupling RE. Moreover, the relative significance gain of decoupling RE compared with decoupling FE is still largely pronounced (PTPN22: and ).
Multi-tissue eQTL analysis
Recently, Sul et al. (27) applied our decoupling approach to multi-tissue eQTL analysis. Their goal was to combine information from multiple tissues to increase power for eQTL identification. They proposed a linear mixed model that accounts for multiple tissues being collected from the same individuals. However, using this approach, the resulting effect size estimates that are to be combined become correlated, which prevents the application of the standard meta-analysis method. Moreover, it is reasonable to assume effect size heterogeneity between tissues. Thus, decoupling with the RE model may assist in analysis. In their study, Sul et al. (27) compared the performance of decoupling FE, decoupling RE and the tissue-by-tissue approach (TBT) in terms of the number of eQTL identified. In TBT, they call an eQTL significant if any of the q-values in the set of 4 or 10 tissues is smaller than the false discovery rate (FDR) threshold. They demonstrated that applying decoupling with FE found a much larger number of eQTL in 4 or 10 mouse tissues than applying TBT, and applying decoupling with RE found even more at the same FDR level of 5%.
We revisited this dataset and examined whether the relative performances between methods remained unchanged if we changed the FDR threshold. See Sul et al. (27) for a detailed description of the dataset. When we varied the FDR threshold down to 0.0001 and up to 0.2, the relative performances of the methods did not change (Fig. 4). For example, at the threshold of FDR = 0.0001, TBT, decoupling FE and decoupling RE discovered 61, 718 and 780 eQTL, respectively, in 10 tissues. At the threshold of FDR = 0.2, TBT, decoupling FE and decoupling RE discovered 3870, 4914 and 5855 eQTL, respectively, in 10 tissues. Thus, decoupling RE improved the number of discovered eQTL by up to 19% compared with the best competing method (decoupling FE). These observations underscore that in multi-tissue eQTL analysis, it is important to take into account both the between-tissue heterogeneity and the between-tissue correlations, and that the decoupling RE approach can handle both of them effectively.

The number of eQTL identified by different methods in the 4 or 10 mouse tissue dataset of Sul et al. (27). We applied decoupling to account for the correlation of estimates between tissues caused by within-individual correlations between tissues. TBT, tissue-by-tissue approach that examines each tissue separately and takes the minimum q-value.
Simulations of decouplable matrices
We define the covariance matrix of the studies in a meta-analysis as decouplable if the decoupled variances are non-negative. In the Materials and Methods section, we analytically demonstrate that when the number of studies is two and the two studies share controls, the covariance matrix is always decouplable. To demonstrate that the covariance matrix is highly likely to be decouplable for more than two studies sharing controls, we performed extensive simulations. We assumed a situation where 10 studies share controls. We assumed the same MAF of 0.3 and odds ratio of 1.2 for all studies. Then, to simulate differences in the sample sizes between studies, we made the number of cases in five studies large and the number of cases in the other five studies small. Specifically, the number of cases of study i was for five studies and for the other five studies. We then varied from 1000 to 20 000 and δ from 0 to 4500 when μ was fixed at 5000. In this simulation, we observed that for all configurations we tested, the matrices were decouplable. Supplementary Material, Figure S3 shows that for all points, the percentage of studies that have negative variance is consistently zero in this simulation. This empirical result demonstrates that in most practical and common situations where studies share controls, the matrices are highly likely to be decouplable.
Discussion
We proposed a general framework for dealing with dependency structure in a meta-analysis. Our approach is philosophically different from other approaches, in that our method transforms the covariance structure of the data so that it can be used in other approaches that assume independence. Therefore, our approach flexibly allows many meta-analysis methods, such as the RE model, to account for dependency between studies and overlapping subjects. The simulations and applications to the WTCCC data support the utility of our approach. Moreover, our method allows the use of standard software packages designed for meta-analysis of GWAS, even when studies share subjects. Our method can be used beyond the standard meta-analysis designs, such as for multi-tissue eQTL analysis.
In our method, the covariance matrix can be non-decouplable in theory (the decoupled variances can be negative). However, as we analytically and empirically showed, for the common and typical situations where controls are shared, the matrices are likely decouplable. However, a covariance matrix that is not decouplable is certainly possible because of numerical instabilities when estimating the variances of effect size estimates. For such situations where a matrix is not decouplable, we can use a heuristic of dropping the least informative (smallest) studies until the variances become positive. If the test turns out to be moderately significant, we can revisit the data and apply the standard Lin and Sullivan (24) approach.
Our method builds upon the Lin and Sullivan (24) approach. Our method is optimal in the same sense that the Lin and Sullivan (24) approach is optimal for the FE model. Thus, the optimality with respect to power is guaranteed only under the FE model. Our false-positive rate simulations demonstrated that our approach has some conservative tendencies under the RE model. However, we showed that the decoupling RE approach has much higher power than the decoupling FE approach (which is equivalent to the Lin and Sullivan (24) approach) in both simulations and in real data, showing that decoupling can be a flexible and useful tool despite the fact that it might not be optimal for all methods. Optimal solutions to account for dependency structure in each meta-analysis method will be an interesting and important topic for further research.
In this article, we demonstrated the false-positive rate of different methods such as decoupling RE only by simulations. Obviously, the value of simulations is limited to the conditions chosen. Nevertheless, we explored a wide range of situations by using randomly chosen sample sizes as well as fixed sample sizes (Supplementary Material, Table S3) so that our simulations can represent realistic scenarios. In future studies, we expect that analytical characteristics of decoupling can be further investigated, which may establish theoretical connection of decoupling to other meta-analysis methods than FE.
We note that one should be careful in interpreting data based on decoupled studies. For example, the heterogeneity testing such as Cochran's Q test (1) is highly conservative when using decoupled studies and may not detect true heterogeneity. Unfortunately, there is no good alternative to Cochran's Q test and the estimate (32) that can take into account correlated data. Developing such methods will also be an interesting future research area.
Materials and Methods
Review of meta-analysis methods
To better discuss our proposed analysis method, we first briefly describe several existing meta-analysis methods.
Fixed effects model
Random effects model
The traditional RE model cannot account for overlapping subjects.
New random effects model of Han and Eskin
P-value-based approaches
Subset approaches
Subset approaches are similar to the FE model, but they differ in that they assume the effects can exist in only a subset of the studies. Han and Eskin (10) computed a statistic called the ‘m-value’, a posterior probability that the effect is present in a study. They incorporated m-values into the weighted z-score FE approach to upweight studies with high m-values. The P-value is assessed using importance sampling. This method cannot account for overlapping subjects. Bhattacharjee et al. (13) proposed an approach that computes FE statistics using all possible subsets of studies in a meta-analysis and uses the maximum statistic. The method expedites the enumeration of all possible subsets of N studies by using a novel statistical technique. This method can account for overlapping subjects.
Bayesian approaches
Morris (15) proposed a Bayesian approach optimized for trans-ethnic meta-analysis. This method utilizes the Markov Chain Monte Carlo (MCMC) procedure to navigate through possible disease models. In his MCMC sampling procedure, closely related populations are given higher chance to have similar effect sizes, which increases the statistical power to detect heterogeneity caused by the population spectrum. This method cannot account for overlapping subjects. Wen and Stephen (16) proposed a new method that takes the heterogeneity in the data into account by using a hierarchical model in the Bayesian framework. Wen (17) recently extended this method to a model selection approach in complex linear systems, which can account for overlapping subjects.
Definition of dependency
In this article, we often describe two studies in a meta-analysis as dependent, which needs to be defined. We define two studies as dependent if the statistics from those studies are correlated. The statistics can be correlated in a number of reasons, the most common one being that the control subjects overlap, resulting in positive correlations.
The Lin and Sullivan approach
Lin and Sullivan (24) proposed a systematic approach for dealing with overlapping subjects in the FE model. Although only applicable to the FE model, this approach was an innovative breakthrough that provided a foundation of methods that account for dependency in meta-analysis. Recently, Zaykin and Kozbur (12) and Bhattacharjee et al. (13) extended this approach to the P-value-based approach and the subset approach, respectively. Our proposed approach also builds upon this method.
Decoupling approach
Decoupling
We propose the following decoupling transformation, which keeps the same information as the Lin and Sullivan (24) approach, with respect to the summary effect size. Suppose that we are given the effect sizes x, the standard errors s and the correlation matrix C as computed by formula (4). Computing C using formula (4) is feasible in many situations since we typically know the exact numbers of shared cases and controls. If this is not the case, an alternative approach would be empirically approximating C by calculating correlations using genome-wide z-scores of the studies. This empirical approximation requires an assumption that the majority of the SNPs in the genome follow the null hypothesis and do not have an effect, which may not be true if the trait is polygenic (36).
Given x, s and C, we propose the following procedure: In the Supplementary Materials, we present a simple R code that performs this procedure.
- Keep the original x:
- Compute the covariance matrix of x:where denotes a diagonal matrix whose diagonals are s.
- Compute the decoupled covariance matrix:
- Update the standard errors:where the brackets [ ] denote the index of an element in a vector or matrix.
Use and in the downstream meta-analysis.
Relation to the Lin and Sullivan approach
Our approach has a relationship to the Lin and Sullivan (24) approach; if the decoupled studies are used in the FE model, the results are equivalent. Thus, one can use existing software packages, such as METAL (30) or MANTEL (2), to obtain optimal results after the data transformation. Below is a brief proof of this relationship.
Decouplable matrix
We define a decouplable covariance matrix as a covariance matrix whose decoupled variances are non-negative. In theory, not every matrix Ω is decouplable. However, in practice, in typical situations where studies share all or a subset of control samples, the covariance matrix is highly likely to be decouplable (see the Results section).
Thus, inequality (9) holds, and therefore, matrix Ω is always decouplable.
Meta-analysis in multi-tissue eQTL analysis
where Y is an matrix denoting expression levels of N individuals in T tissues. 1 is an NT × T matrix denoting the intercepts for T tissues. The first column is the intercept for the first tissue; the first N rows are ones, and the next are zeros. Other columns are similarly defined. α is a matrix denoting coefficients of intercepts. is an matrix denoting SNP genotype for T tissues. This is similar to the 1 matrix, and we replace ones in the 1 matrix with SNP information. β is a matrix denoting the coefficients of SNP effects in T tissues. is the random effects due to the repeated measurements of individuals. D is an NT × NT matrix representing how individuals are shared across tissues. is the random noise.
Our goal is to combine information from multiple tissues in . However, as shown in the variance formula, the estimates in are dependent on each other. Thus, we can apply our decoupling strategy. Note that it is reasonable to assume that the genetic effects on expression level can vary between tissues. Therefore, application of the RE model is appropriate in this situation.
Funding
This work was supported by the Asan Institute for Life Sciences, Asan Medical Center, Seoul, Korea (2015-0222 to B.H.); the Korean Health Technology R&D Project, Ministry of Health & Welfare, Republic of Korea (HI14C1731 to B.H.); the National Science Foundation (0513612 to D.D. and E.E., 0731455 to D.D. and E.E., 0729049 to D.D. and E.E., 0916676 to D.D. and E.E., 1065276 to D.D. and E.E.); the National Institutes of Health (K25-HL080079 to D.D. and E.E., U01-DA024417 to D.D. and E.E., P01-HL30568 to D.D. and E.E., PO1-HL28481 to D.D. and E.E., NIH-NIAMS 1R01AR062886-01 to B.H., S.R. and P.I.W.d.B.) and the Netherlands Organization for Scientific Research (NWO) (VIDI award to P.I.W.d.B.).
Acknowledgements
We thank the WTCCC for data access.
Conflict of Interest statement: None declared.
References