GEM: scalable and flexible gene–environment interaction analysis in millions of samples

Abstract Motivation Gene–environment interaction (GEI) studies are a general framework that can be used to identify genetic variants that modify the effects of environmental, physiological, lifestyle or treatment effects on complex traits. Moreover, accounting for GEIs can enhance our understanding of the genetic architecture of complex diseases and traits. However, commonly used statistical software programs for GEI studies are either not applicable to testing certain types of GEI hypotheses or have not been optimized for use in large samples. Results Here, we develop a new software program, GEM (Gene–Environment interaction analysis in Millions of samples), which supports the inclusion of multiple GEI terms, adjustment for GEI covariates and robust inference, while allowing multi-threading to reduce computation time. GEM can conduct GEI tests as well as joint tests of genetic main and interaction effects for both continuous and binary phenotypes. Through simulations, we demonstrate that GEM scales to millions of samples while addressing limitations of existing software programs. We additionally conduct a gene-sex interaction analysis on waist-hip ratio in 352 768 unrelated individuals from the UK Biobank, identifying 24 novel loci in the joint test that have not previously been reported in combined or sex-specific analyses. Our results demonstrate that GEM can facilitate the next generation of large-scale GEI studies and help advance our understanding of the genetic architecture of complex diseases and traits. Availability and implementation GEM is freely available as an open source project at https://github.com/large-scale-gxe-methods/GEM. Supplementary information Supplementary data are available at Bioinformatics online.


Supplementary Methods
Consider a generalized linear model for N unrelated individuals where µ i = E(Y i |X i , G i ) is the conditional mean of the phenotype Y i for individual i given covariates X i (including an intercept for the model) and the genotype G i of a single genetic variant. The gene-environment interaction terms C i and S i are the products of G i and c and q environmental terms (which are included in the covariates X i ), respectively. The link function g(·) is a monotone function (usually the identity link function for continuous phenotypes, and the logit link function for binary phenotypes). Let Y = (Y 1 Y 2 . . . Y N ) T be a length N vector of the phenotypes, X = (X T 1 X T 2 . . . X T N ) T be an N × p matrix of p covariates (including an intercept for the model and all the c + q environmental terms that interact with the genotype), G = (G 1 G 2 . . . G N ) T be a length N vector of the genotype for this single genetic variant, C = (C T 1 C T 2 . . . C T N ) T be an N × c matrix of c gene-environment interaction terms that are adjusted for as covariates, S = (S T 1 S T 2 . . . S T N ) T be an N × q matrix of q gene-environment interaction terms of interest in the gene-environment interaction test (H 0 : β S = 0 versus H 1 : β S = 0), we fit a null model without any genetic effects g(µ i ) = X i β X . ( For a linear regression model, we can get the estimateβ X = (X T X) −1 X T Y and the length N residual vector r = Y − Xβ X , as well as the residual variance estimateσ 2 = r T r N −p . For a logistic regression model, we can iteratively solve the problem until convergence, with the estimateβ X = (X T W X) −1 X T WỸ , where W is a diagonal matrix with elementsμ i (1 −μ i ), withμ i = exp(Xiβ X ) 1+exp(Xiβ X ) being the estimate of P (Y i = 1|X i , G i ) for individual i, and also the length N residual vector r = Y −μ, whereμ = (μ 1μ2 . . .μ N ) T is a length N 1 vector of these estimated phenotype (disease) probabilities,Ỹ = Xβ X + W −1 r is the working vector at convergence.
For each genetic variant G in a linear regression model, we first computẽ G = G − X(X T X) −1 X T G,C = C − X(X T X) −1 X T C andS = S − X(X T X) −1 X T S, and estimate the marginal genetic effectβ G = (G TG ) −1G T r with a model-based variance estimate V ar(β G ) =σ 2 (G TG ) −1 or a robust (sandwich) variance estimate V ar R (β G ) = (G TG ) −1G T DG(G TG ) −1 , where D is a diagonal matrix with elements r 2 i (r i is the ith element of the residual vector r). Let U = (GCS) T r be a length (1 + c + q) vector (c ≥ 0), V = (GCS) T (GCS) and Ω = (GCS) T D(GCS) be (1 + c + q) × (1 + c + q) matrices, we then jointly estimate the genetic main effect and gene-environment interaction effectsβ G,C,S = (β Gβ T Cβ T S ) T = V −1 U , with a model-based covariance matrix estimate Cov(β G,C,S ) =σ 2 V −1 or a robust (sandwich) covariance matrix estimate Cov R (β G,C,S ) = V −1 ΩV −1 .
Similarly, for a logistic regression model, we first compute the quantities G = G − X(X T W X) −1 X T W G,C = C − X(X T W X) −1 X T W C and S = S −X(X T W X) −1 X T W S, and estimate the marginal genetic effectβ G = where D is a diagonal matrix with elements r 2 i (r i is the ith element of the residual vector r). Let U = (GCS) T r be a length (1+c+q) vector (c ≥ 0), V = (GCS) T W (GCS) and Ω = (GCS) T D(GCS) be (1 + c + q) × (1 + c + q) matrices, we then jointly estimate the genetic main effect and gene-environment interaction effectsβ G,C,S = (β Gβ T Cβ T S ) T = V −1 U , with a model-based covariance matrix estimate Cov(β G,C,S ) = V −1 or a robust (sandwich) covariance matrix estimate Cov R (β G,C,S ) = V −1 ΩV −1 .
As the sample size is very large, we compute asymptotic p-values for both linear and logistic regression models. Specifically, under the null hypothesis of no marginal genetic effects (H 0 : β G = 0), the marginal genetic effect test statistic β 2 G V ar(β G ) (or the robust versionβ 2 G V ar R (β G ) ) follows a χ 2 distribution with 1 degree of freedom. Let Cov(β S ) and Cov R (β S ) be q × q submatrices of Cov(β G,C,S ) and Cov R (β G,C,S ), respectively, corresponding to the q gene-environment interaction effect estimates of interestβ S , under the null hypothesis of no geneenvironment interactions of interest (H 0 : β S = 0), the interaction test statis-

Supplementary Figures
Supplementary Figure S1: Quantile-quantile plots from type I error simulations with no genetic effects. Plots compare observed versus expected p-value distributions for genetic variants on chromosomes 1-20, for which no genetic effects were simulated. Plots on the left correspond to the interaction test, and plots on the right correspond to the joint test. Plot titles denote the simulated distribution for the exposure, and panels (d)  Supplementary Figure S4: Benchmarking of GEM and alternative software programs for run time with a continuous or binary outcome and varying number of covariates using 100,000 simulated variants. Run time is shown as a function of sample size (x-axis). Numbers above each panel correspond to the number of non-exposure covariates (top) and outcome type (bottom). Colors correspond to software program and dashed lines correspond to runs that used the option to obtain robust standard errors. Circles and triangles correspond to programs compiled without or with Intel MKL, respectively. "GEM-opt" denotes GEM runs using optimal parameters for speed, including compilation with MKL and pgen file inputs. Data points are not shown for runs that exceeded 100GB of memory (ProbABEL, N>100k), or were unfinished after 7 days (SUGEN, N=5M, 30 covariates; QUICKTEST, N=5M, 30 covariates, robust). Figure S5: Benchmarking of GEM and alternative software programs for memory usage with a continuous or binary outcome and varying number of covariates using 100,000 simulated variants. Maximum memory footprint is shown as a function of sample size (x-axis). Numbers above each panel correspond to the number of non-exposure covariates (top) and outcome type (bottom). Colors correspond to software program and dashed lines correspond to runs that used the option to obtain robust standard errors. Circles and triangles correspond to programs compiled without or with Intel MKL, respectively. "GEM-opt" denotes GEM runs using optimal parameters for speed, including compilation with MKL and pgen file inputs. Data points are not shown for runs that exceeded 100GB of memory (ProbABEL, N>100k), were unfinished after 7 days (SUGEN, N=5M, 30 covariates; QUICKTEST, N=5M, 30 covariates, robust), or some that took less than 10 seconds to run (PLINK2, N=10k, continuous, 0 covariates; GEM-opt, N=10k, 0 covariates). Figure S6: Concordance of GEM and ProbABEL results. Regression coefficients were retrieved for a random 5000 variants from benchmarking runs using 3 covariates and a sample size of 10,000. Y-and x-axes display log p-values from GEM and ProbABEL, respectively. Panels correspond to the interaction test with continuous outcome (a), interaction test with binary outcome (b), joint test with continuous outcome (c), and joint test with binary outcome (d). Gray dashed line represents y = x.

Supplementary
Supplementary Figure S7: Quantile-quantile p-value plots from the UK Biobank waist-hip ratio analysis. Panels correspond to the gene-environment interaction test (a) and joint test (b). Y-axis and xaxis display negative logarithms of association p-values that were observed and expected (based on a uniform distribution), respectively. Lambda values correspond to genomic inflation statistics (median chisquared statistic / chi-squared statistic corresponding to p = 0.5).
Supplementary Figure S8: Manhattan plot displays association strengths for the joint test of genetic main effect and × interaction effect. x-axis represents genomic position and y-axis represents the negative logarithm of the p-value for association at that locus. Dashed line denotes the genome-wide significance threshold (p < 5×10 -8 ). Variants shown in orange passed a genome-wide significance threshold for both interaction and marginal effects. Variants shown in purple passed a genome-wide significance threshold for interaction effect, but not the marginal effect. Variants shown in green passed a genome-wide significance threshold using the joint test, but not for interaction nor marginal effects. For visualization purposes, variants with p < 1×10 -100 were excluded. Figure S9: Comparison between results from GEM and alternative software programs for top interaction test hits. (a) Negative logarithms of p-values are shown for QUICKTEST (y-axis) and GEM (x-axis). Because QUICKTEST cannot incorporate interaction covariates, it is compared against GEM p-values from both the primary model (including the BMI interaction covariate; left panel) and a model adjusting for BMI but not including an interaction covariate (equivalent to the QUICKTEST model; right panel). (b) As in (a), but comparing to SUGEN. (c) As in (a), but comparing to PLINK2. As PLINK2 can incorporate interaction covariates but not robust standard errors, the right panel displays results from a GEM run including the BMI interaction covariate and using model-based standard errors. Figure S10: Sex dimorphism of the effect of rs13389219 on waist-hip ratio (WHR). (a) Statistical summaries of WHR are shown as a function of sex (primary x-axis) and genotype at rs13389219 (fill colors). Hard-called genotypes were assigned if the corresponding genotype probability was at least 90%, and otherwise the sample was excluded for visualization. Boxplots represent medians (middle bar), first and third quartiles (hinges), and 1.5 interquartile ranges from the hinges (whiskers). Outliers outside of the boxplot whiskers are not shown for ease of visualization. (b) Genetic effect estimates (in raw WHR units) for rs13389219 from sex-stratified analyses (x-axis: females; y-axis: males) are compared. The model was equivalent to that used in the primary analysis other than the removal of interaction terms. Horizontal and vertical bars represent 95% confidence intervals for the female and male effect estimates, respectively, and the dashed line represents y = x. Figure S11: Influence of × interaction adjustment on × interaction tests. y-and x-axes display negative logarithms of p-values obtained from the primary model (with a × interaction adjustment) and a model without the × interaction adjustment, respectively. Variants included in the plot are those found to be sex dimorphic in Pulit et al. (at p < 5x10 -8 ) with no genomewide significant interaction in our analysis (shown in green) or vice versa (shown in orange). Figure S12: Genomic inflation after down-sampling. Genomic inflation lambda values were calculated for the marginal genetic effect terms for each of three random down-samplings of the UK Biobank dataset (see Methods). Quantile-quantile plots show observed versus expected -log10(p) for either marginal or interaction terms (as labeled). Gray and blue curves correspond to the marginal genetic effect and GEI effect for the full-sample analysis (N=352,768), while red, green, and yellow curves correspond to the marginal genetic effect for the three down-sampled datasets (N=87,695 each). Only p-values > 10 -30 are shown for the purposes of visualization. Supplementary Tables Supplementary Table S1