Identifying small-effect genetic associations overlooked by the conventional fixed-effect model in a large-scale meta-analysis of coronary artery disease.

MOTIVATION
Common small-effect genetic variants that contribute to human complex traits and disease are typically identified using traditional fixed-effect meta-analysis methods. However, the power to detect genetic associations under fixed-effect models deteriorates with increasing heterogeneity, so that some small-effect heterogeneous loci might go undetected. Han and Eskin developed a modified random-effects meta-analysis approach (RE2) that is more powerful than traditional fixed and random-effects methods at detecting small-effect heterogeneous genetic associations, updating the method (RE2C) to identify small-effect heterogeneous variants overlooked by traditional fixed-effect meta-analysis. Here we re-appraise a large-scale meta-analysis of coronary disease with RE2C to search for small-effect genetic signals potentially masked by heterogeneity in a fixed-effect meta-analysis.


RESULTS
Our application of RE2C suggests a high sensitivity but low specificity of this approach for discovering small-effect heterogeneous genetic associations. We recommend that reports of small-effect heterogeneous loci discovered with RE2C are accompanied by forest plots and SPRE (standardized predicted random-effects) statistics to reveal the distribution of genetic effect estimates across component studies of meta-analyses, highlighting overly influential outlier studies with the potential to inflate genetic signals.


AVAILABILITY
Scripts to calculate SPRE statistics and generate forest plots are available in the getspres R package entitled from https://magosil86.github.io/getspres/.


SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics Online.

1 Introduction Yang et al., 2011) even at heterogeneous loci. A modification of the traditional random-effects method, RE2, was designed to detect genetic associations both in the presence and absence of heterogeneity, to provide an opportunity to identify small-effect heterogeneous variants that might go unnoticed in a FE meta-analysis (Han and Eskin, 2011). Most users of the RE2 random-effects method employed it to refine associations at significant and suggestive genetic signals identified in FE meta-analyses (Sapkota et al., 2015(Sapkota et al., , 2017Wyss et al., 2018); in its latest iteration, RE2C (Lee et al., 2017) it reports a subset of variants detected by RE2 where P RE2 P FE: The RE2C update is intended to have a broad application beyond the augmentation of summary association P-values of variants identified in FE meta-analysis to the discovery of additional and potentially novel loci. The RE2 and by extension RE2C random-effects method's power advantage over traditional fixed and randomeffects meta-analysis models is partly attributable to a relaxed null hypothesis, which assumes homogeneity of genetic effects under the null and thereby provides a greater contrast between the null and alternative hypotheses H 0 : l ¼ 0 and s 2 ¼ 0 À versus H 1 : l 6 ¼ 0 or s 2 > 0; asymptoticallyÞ.
Heterogeneity of genetic effects might arise from biologically relevant differences among contributing studies in a meta-analysis, such as diverse: ancestries, linkage disequilibrium patterns, subphenotypes, ages-of disease onset, family-history of disease or gender. Alternatively, differences in the direction and/or size of genetic effect-estimates among participating studies in a meta-analysis could reflect genotyping error or population structure (i.e. local admixture), where, for example, the average genetic effect estimate at a variant of interest is inflated by a few outlier studies showing outsized effects while the majority of study effects are marginal. Heterogeneity at individual variants can be explored through forest plots and the calculation of standardized predicted random-effects (SPREs), while heterogeneity patterns across multiple variants can be conveniently inspected through the calculation of M statistics (Magosi et al., 2017). Notably, SPREs are precision weighted residuals that indicate the direction and extent with which individual studies in a meta-analysis deviate from the average genetic effect (Harbord and Higgins, 2008;Magosi et al., 2017), and can be a useful quantitative indicator of whether the average genetic effect at a variant of interest might be unduly influenced by outlier studies showing extreme effects.
In this report, we revisit the CARDIoGRAMplusC4D metaanalysis (60 801 cases and 123 504 controls) of coronary artery disease (CAD) with the RE2C random-effects method, to search for additional CAD loci potentially masked by heterogeneity in the primary FE meta-analysis.

CARDIoGRAMplusC4D
Summary data (i.e. logistic regression coefficients and their corresponding standard errors) were collated from 48 genome-wide association studies of coronary disease risk that comprised individuals from 6 different ancestry groups including: African (n ¼ 1) and Hispanic American (n ¼ 1), East (China and Korea, n ¼ 3) and South (India and Pakistan, n ¼ 4) Asian, Middle Eastern (Lebanese, n ¼ 1) and European (n ¼ 38); meta-analysis was conducted for a set of $ 9 million variants with minor allele frequencies >0.005 (CARDIoGRAMplusC4D Consortium, 2015). Design details of each participating CARDIoGRAMplusC4D study are summarized in Supplementary Table S1; the coronary disease phenotype included patients with an inclusive CAD diagnosis (e.g. myocardial infarction, acute coronary syndrome, chronic stable angina or coronary stenosis >50%). Study-level genomic correction (Devlin and Roeder, 1999) was applied to each study to minimize false positives induced by inflated association test statistics. Variant effect-size estimates (b coefficients scaled as log e (odds ratios) from an additiveeffects-only association model) in each study were aligned such that the same risk allele was compared across the studies assembled in the meta-analysis. The studies contributing to the CARDIoGRAMplusC4D study obtained ethical approval from the ethics committees of the respective medical faculties, and informed consent was obtained from all participants, summary genetic association data were anonymously meta-analysed and reported here. Membership of the CARDIoGRAMplusC4D Consortium is provided in the Supplementary Text S1. Requests for access to the summary statistics are coordinated by the CARDIoGRAMplusC4D Steering Committee (www.cardiogramplusc4d.org).

UK Biobank
The UK Biobank study (UKBB) is a large-scale prospective study of over half a million participants commissioned to assemble comprehensive data on genotypic, socio-demographic, lifestyle and environmental factors with the aim of developing better strategies for the prevention, diagnosis and treatment of common diseases (Sudlow et al., 2015) such as cardiovascular disease (Littlejohns et al., 2019). Data from an interim release of GWAS genotypes for 296 525 participants were previously merged and analysed with clinical phenotype data that identified 34 541 cases of coronary heart disease and 261 984 controls from England, Scotland and Wales aged 45-69 years (van der Harst and Verweij, 2018). Coronary disease case status was assigned to prevalent and incident cases of myocardial infarction, acute coronary syndromes and associated therapeutic interventions (e.g. revascularization). Association summary statistics (b coefficients scaled as log e (odds ratios) and associated standard errors from an additive-effects-only logistic regression association model) from this analysis were downloaded from the www.cardiomics.net server. Design details of the UK Biobank participants to compare with the CARDIoGRAMplusC4D cohorts are included in Supplementary Table S1.

RE2 and RE2C meta-analysis
Genetic association meta-analyses are typically performed under a RE when the objective is both to estimate a summary effect (i.e. average genetic effect) across studies in a meta-analysis and measure the amount of heterogeneity. Consider a meta-analysis comprising S studies ðs ¼ 1; 2; 3; . . . ; SÞ where the genetic effect-size estimate and corresponding standard error of a variant of interest were obtained via regression modelling in each study, and the average genetic effect estimate,l calculated as the inverse-variance weighted mean of the individual study effects: where y s represents the study effect-size estimate in the sth study, and w s denotes the weight assigned to the sth study, which can be calculated as w s ¼ 1 r 2 s þŝ 2 ð Þ : Notably, r 2 s andŝ 2 represent sampling variance and heterogeneity, respectively.

Traditional RE
The traditional RE tests the null hypothesis that the average genetic effect, l is zero that is, H 0 : l ¼ 0 versus H 1 : l 6 ¼ 0, and its summary association test statistic under the null is given by, Z 2 RE 1 l SEðlÞ 2 $ v 2 1 (asymptotically) (Neupane et al., 2012).

Contemporary random-effects model (RE2)
In contrast to the traditional RE which assumes the presence of heterogeneity under the null the RE2 model tests the null hypothesis that the average genetic effect is zero and there is no heterogeneity; that is, H 0 : l ¼ 0 and s 2 ¼ 0 versus H 1 : l 6 ¼ 0 or s 2 > 0 (asymptotically) (Han and Eskin, 2011;Neupane et al., 2012). The summary association test statistic (or likelihood ratio test statistic) for the RE2 model under the null is denoted by: and approximates a 50:50 mixture of v 2 1 and v 2 2 asymptotically in meta-analyses with larger numbers of studies. For meta-analyses with fewer studies (2-50), Han and Eskin provide tabulated RE2 Pvalues corrected for small sample-size based on the assumption that the studies are equally weighted (i.e. same sample-size). The asymptotic RE2 summary association P-value is denoted by: after a correction for small samples, the RE2 summary association P-value is given by: where k N; S RE2 ð Þis the small-sample correction factor (Lee et al., 2017).

Updated RE2 model (RE2C)
The RE2C approach is an adaptation of the RE2 model designed to: (i) facilitate discovery of small-effect heterogeneous variants and (ii) minimize redundancy between genetic variants identified by the FE and RE2 models; as it is commonplace to perform an FE analysis prior to a random-effects analysis when conducting genetic association meta-analyses. To reduce redundancies between RE2 and FE analyses the RE2C approach partitions summary association P-values produced by the RE2 model into two groups assigning variants with RE2 P-value FE P-value the RE2 summary association statistic, S RE2 and zero otherwise (Lee et al., 2017): In contrast to the RE2 summary association statistic the RE2C statistic, S RE2C does not approximate a 'well-known' asymptotic distribution; to calculate RE2C P-values the RE2 summary association statistic is decomposed into two component statistics, the first, S FE is equal to the square of the FE summary association statistic, Z 2 FE and asymptotically approximates v 2 1 under the null. The second, S Het tests for the presence of heterogeneity akin to the Q-test of heterogeneity and asymptotically approximates a 50:50 mixture of 0 and v 2 1 when the number of studies in a meta-analysis is large, for smaller meta-analyses, Lee et al. (2017) provide tabulated empirical distributions of S Het . For each S FE , the RE2C approach searches for S Het such that P RE2 P FE and the resulting lower boundary of S Het is referred to as, S Het:low ðS FE ; NÞ where N is the number of studies. Then for an observed RE2C statistic, d S RE2C the range of S FE is divided into K small bins ðx i ¼ 1; 2; 3; . . . ; KÞ (e.g. 1000 bins in [0,50]) and the RE2C summary association P-value is approximated by: such that, P RE2C < P RE2 while P RE2 P FE and where Dx denotes the width of the bins (Lee et al., 2017). 2.3 Evaluation of heterogeneity for individual variants and M statistics 2.3.1 Calculation of SPRE statistics Standardized predicted random-effect statistics are precisionweighted residuals that capture the direction and extent with which individual genetic effects of studies in a meta-analysis deviate from the average genetic effect at a variant of interest. Consider a genetic association meta-analysis (P), comprising S GWAS ðs ¼ 1; 2; 3; . . . ; SÞ and V independently associated lead variants ðv ¼ 1; 2; 3; . . . ; VÞ: At each lead variant, study effect-size estimates (and the corresponding standard errors) are analysed with a RE to estimate the average genetic effect and separate the variability observed among study effects into random sampling variation and between-study heterogeneity. A SPRE is then computed for each lead variant such that the SPRE for the vth lead variant in the sth study is: This yields an array of SPREs, P S; V ¼ that can be exploited to reveal systematic genetic differences among studies in the meta-analysis. Specifically, SPREs can be aggregated by study to expose outlier studies showing either consistently stronger or weaker than average genetic effects.

Calculation of M statistics-aggregation of SPREs
SPRE statistics can be aggregated in a variety of ways, a simple approach that both identifies systematic outliers and reveals their direction of effect is to calculate the 'mean' aggregate heterogeneity statistic, M. M statistics are computed by calculating the arithmetic mean of SPREs within each study in a meta-analysis so that each study has a single M statistic value and the M statistic value for the sth study is represented by: Assuming the SPREs of lead variants in each study are mutually independent standard normal random variables, that is SPRE $ U 0; 1 ð Þ; with mean: E SPRE ð Þ¼ 0 and variance: Var SPRE ð Þ¼ 1 then M is normally distributed,

Q-statistic and heterogeneity index
Heterogeneity was also assessed using the Q-statistic (Cochran, 1954) and the heterogeneity index (I 2 ) measure (Higgins and Thompson, 2002); I 2 was further used to quantify heterogeneity in M statistics.

Single-variant heterogeneity analysis of 382 novel RE2C loci
Most (85.6%) of the lead variants showed marked heterogeneity (Q-statistic P < 1Â10 À7 ), with at least half of the lead variants showing relatively high levels of heterogeneity (I 2 >72.1%) (Supplementary Table S2). Next, we calculated SPREs and generated forest plots to inspect heterogeneity patterns at lead variants of the 382 RE2C loci. Most ð90%Þ of the RE2C lead variants had one or more outlier studies where genetic effect-size estimates deviated substantially ð SPRE j j > 3rÞ from the average genetic effect. This empirical threshold to flag overly influential outliers ð SPRE j j > 3rÞ was informed by rs2891168 (chromosome 9p21) in the primary CARDIoGRAMplusC4D meta-analysis, where this wellestablished locus had max SPRE j j¼ 2:87 (CARDIoGRAMplusC4D Consortium, 2015). An inspection of forest plots for the 382 RE2C lead variants revealed heterogeneity patterns that were grouped into three categories (Supplementary Fig. S1). Most (n ¼ 323) of the lead variants fell in the first category where at least one study showed outsized effects while the majority of the studies showed minimal effects ( Supplementary Fig. S2 and Table S3). Lead variants (n ¼ 28) in the second category generally showed heterogeneity patterns with outlier studies showing contrasting effects, in particular the forest plots showed both positive outlier studies ðSPRE > þ 3rÞ with the potential to inflate the average genetic effect as well as negative outliers ðSPRE < À 3rÞ that might lower or change the direction of the mean genetic effect, a scenario where dropping either type of outlier would likely induce a false positive or negative signal (Supplementary Fig. S3 and Table S3). The final category comprised 31 lead variants where there was little evidence of overly influential outlier studies consistent with heterogeneity patterns plausibly induced by biologically relevant differences ( Supplementary Fig. S4 and Table S3). A general trend that emerged from inspecting heterogeneity patterns at the individual RE2C lead variants was that RE2C P-values became more extreme (i.e. smaller) with increasing levels of heterogeneity (Supplementary Table S2).

Replication in the UK Biobank study
We next explored whether genetic associations between lead variants at the novel RE2C loci and CAD risk could be replicated in a large-scale prospective study based on 296 525 participants (including 34 541 cases of coronary heart disease) from England, Scotland and Wales aged 45-69 years (van der Harst and Verweij, 2018).
Only 24 of the 323 RE2C lead variants available in the UK Biobank GWAS were replicated (P UKBB < 5 Â 10 À5 , Supplementary Table  S5). All but 3 of the replicated genetic signals had traditional FE meta-analysis P-values that were significant at genome-wide levels (P FE < 5 Â 10 À8 ) and just 2 of the 24 showed marked heterogeneity (I 2 >0.5) (Supplementary Table S5). Furthermore, 3 replicated variants included an influential outlier study in the CARDIoGRAMplusC4D meta-analysis, these 3 variants were also GWAS-significant in the FE meta-analysis (Supplementary Tables  S2 and S5). These findings are consistent with Han and Eskin's (2011) observation that the power of RE2 only exceeded FE metaanalysis for markedly heterogeneous variants.
Finally, a meta-regression model of M statistics for 323 RE2C lead variants in a combined CARDIoGRAMplusC4D and UK Biobank meta-analysis confirmed genomic control inflation as a potential source of systematic heterogeneity in genetic meta-analyses (Supplementary Table S6 and Fig. S6).

Discussion
Our application of the RE2C method to the CARDIoGRAMplusC4D meta-analysis dataset highlights the high sensitivity but low specificity of the method as a discovery tool for small-effect heterogeneous genetic associations. Consequently the practical advantage afforded by the improved power of the RE2C method will likely be in augmenting P-values for putative loci highlighted by traditional fixed and random-effects meta-analyses.
Altogether, the majority (n ¼ 331) of lead variants discovered in the CARDIoGRAMplusC4D meta-analysis by the RE2C randomeffects method fell outside the scope of tentatively associated CAD risk variants (P FE >5Â10 À5 ) (Supplementary Table S2). Significant P-values under the RE2 and RE2C models can represent a nonnull average genetic effect and/or considerable heterogeneity ðH 0 : l ¼ 0 and s 2 ¼ 0 versus H 1 : l 6 ¼ 0 or s 2 > 0; asymptoticallyÞ (Neupane et al., 2012). Therefore, the genome-wide significant RE2C P-values at the 277 lead variants where genetic associations with CAD were irreproducible in the UKBB dataset (P UKBB >5Â10 À5 ) and where P FE >5Â10 À5 , likely signify substantial heterogeneity of genetic effects at the individual variants rather than novel CAD signals.
Small-effect genetic associations at variants with relatively high heterogeneity might elicit skepticism regarding the potential reproducibility of such associations. However, there are notable exceptions within the coronary disease landscape, such as rs2891168, the lead variant for the chromosome 9p21 CAD risk locus in the CARDIoGRAMplusC4D data (2015) that shows substantial heterogeneity (Q-statistic P < 4.2Â10 À7 ; I 2 ¼ 58%) but with no exceptional outlier studies (i.e. SPRE j j < 2:6r), a heterogeneity pattern typified in Supplementary Figure S4. rs2891168 tags one of the strongest associated loci in CARDIoGRAMplusC4D (odds ratio ¼ 1.2, P < 2Â10 À98 ), a meta-analysis dataset heavily weighted by European (69%), South Asian (20%) and East Asian (7%) data (Supplementary Table S1). Other tagging variants for this locus in strong linkage disequilibrium have been convincingly validated to show comparable strength associations with CAD risk in some non-European populations (e.g. India and Pakistan, Coronary Artery Disease (C4D) Genetics Consortium, 2011;Han Chinese, Lu et al., 2012; multi-ethnic cohorts from East Asia, Han et al., 2017) but not for instance, and to our knowledge in populations of African ancestry. The latter are poorly represented in CARDIoGRAMplusC4D (African Americans form $1% of the total data), limiting opportunities to judge the informativity or otherwise of individual loci in this meta-analysis dataset. Based on our experience of applying RE2C to the CARDIoGRAMplusC4D dataset, we recommend as best practice that reports of small-effect heterogeneous loci discovered with this method be accompanied by forest plots and SPRE statistics to explore the distribution of genetic effect estimates across participating studies. This can highlight overly influential outlier studies with the potential to inflate genetic signals prompting researchers to reflect upon the underlying data that gave rise to novel heterogeneous associations.