gJLS2: an R package for generalized joint location and scale analysis in X-inclusive genome-wide association studies

Abstract A joint analysis of location and scale can be a powerful tool in genome-wide association studies to uncover previously overlooked markers that influence a quantitative trait through both mean and variance, as well as to prioritize candidates for gene–environment interactions. This approach has recently been generalized to handle related samples, dosage data, and the analytically challenging X-chromosome. We disseminate the latest advances in methodology through a user-friendly R software package with added functionalities to support genome-wide analysis on individual-level or summary-level data. The implemented R package can be called from PLINK or directly in a scripting environment, to enable a streamlined genome-wide analysis for biobank-scale data. Application results on individual-level and summary-level data highlight the advantage of the joint test to discover more genome-wide signals as compared to a location or scale test alone. We hope the availability of gJLS2 software package will encourage more scale and/or joint analyses in large-scale datasets, and promote the standardized reporting of their P-values to be shared with the scientific community.


Introduction
Genetic association studies examine the relationship between the genotypes of a single-nucleotide polymorphism (SNP; denoted by G) and a quantitative phenotype (denoted by Y), by testing the mean differences in Y according to the genotypes, otherwise known as a location test. More recently, several reports have investigated association with phenotypic variance of complex quantitative traits (Par e et al., 2010;Yang et al. 2012;Shungin et al., 2017), by testing the variance differences in Y across the genotype groups of a SNP, or known as a scale test, in hopes of finding biologically meaningful markers. One possible explanation of a significant scale effect is the presence of gene-gene (GÂG) or gene-environment (GÂE) interactions; both referred to as GÂE hereinafter. Unlike a direct test of GÂE interaction, a scale test can be used to indirectly infer GÂE without knowledge of the interacting covariates, thus alleviates the multiple hypothesis burden of testing all possible pairwise interactions, and the assumption that all interacting environmental variables could be (accurately) measured.
A more powerful approach to prioritize biologically relevant candidates is to jointly evaluate location and scale effects (Aschard et al., 2013;Cao et al. 2014;. As compared to other existing single-marker based joint tests, a joint location-scale (JLS) association test  is easy to implement on individual-level data or in a meta-analysis, requiring only the location and scale P-values for each SNP, marking the first generation of the JLS tool (https://github.com/dsoave/ JLS; accessed 2022 February 13). Since methods for genome-wide association studies (GWAS) of location are well-established, the main focus was on improving scale tests tailored for genetic data. Soave and Sun (2017) generalized the well-known Levene's test to handle complex data structures often observed in genetic studies, such as correlated samples and dosage data, leading to the next update, namely, generalized JLS (gJLS; https://github.com/ dsoave/gJLS; accessed 2022 February 13). Following the X-inclusive trend to genome-wide analyses, robust and powerful location (Chen et al. 2021) and scale tests (Deng et al. 2019) tailored for X-chromosome are now available.
In this study, we describe a generalized joint location and scale analysis tool (gJLS2) as an update to the JLS and gJLS methodology for autosomes, with added functionalities that (1) support X-chromosome mean and variance association analyses, (2) handle imputed data as genotypic probabilities or in dosage format, (3) allow the incorporation of summary statistics for location and/or scale tests, (4) implement a flexible framework that can accommodate additional covariates in both the location and scale association models, and (5) improve the computational time required for large-scale genetic data such as the UK biobank (Bycroft et al. 2018). We hope the availability of this unifying software package will encourage more X-inclusive, genome-wide, gJLS2 association analysis for complex continuous traits, particularly for those believed to be enriched for genetic interactions.

Materials and methods
The software can be installed in an R environment from CRAN: or github for the most recent version: via the "devtools" package (Wickham et al. 2018). An accompanying guide that documents each of the analytical options and scenarios is available at https://weiakanedeng.github.io/gJLS2/. Any feedbacks/bugs can be reported under github's issues tab (https://github.com/WeiAkaneDeng/gJLS2/issues).

Data preparation
The gJLS2 software package requires the minimal inputs of a quantitative trait and genotype data, which can be discrete genotype counts, continuous dosage genotype values or imputed genotype probabilities. The genotype data can be supplied in PLINK format via the R plug-in option using PLINK 1.9 (Chang et al. 2015) or any other format that can be read in R with packages "BGData" and "BEDMatrix" (Grueneberg and de los Campos 2019). The phenotype and covariate should be supplied in the same file, and if sex is required for X-chromosome analysis, then it needs to be the first column after individual IDs.
For smaller analyses or testing purpose, it is possible to use the package in R GUI directly. However, for genome-wide analyses on larger datasets, we recommend either Rscript commands or as an R-plugin within PLINK combined with a high-performing computing cluster environment. We will demonstrate all 3 approaches using the example datasets provided.

Example datasets
The package included 2 example datasets: one comprises of simulated phenotype, covariate data and real X-chromosome genotypes from the 1000 Genomes Project (The 1000 Genomes Project Consortium 2015), denoted by "chrXdat"; and another is based on summary statistics from the Genetic Investigation of ANthropometric Traits (GIANT) consortium for body mass index (BMI) and human standing height (Yang et al. 2012;Yengo et al. 2018).
The dataset "chrXdat" consists of 5 hand-picked SNPs rs5983012 (A/G), rs986810 (C/T), rs180495 (G/A), rs5911042 (T/C), and rs4119090 (G/A) that are outside of the pseudo-autosomal region, to cover observed minor allele frequency (MAF; calculated in females and rounded to the nearest digit) of 0.1, 0.2, 0.3, 0.4, and 0.45, respectively. See Supplementary Material Section 2 for more details on the simulated dataset. The summary statistics data comprised of association P-values of SNPs with the mean of BMI and height, under the column name "gL" for location, and those with the variance, under the column name "gS." The example data only included chromosome 16 SNPs to keep the size of example datasets manageable.

Location association
For autosomal SNPs, a linear regression fitted using the ordinary least square (OLS) is flexible to accommodate additional covariates and the default option. To account for related samples, the generalized least square (GLS) method is used by specifying a covariance structure for error terms in smaller samples. Users can either provide the covariance matrix or specify a structure for the covariance matrix according to predefined subgroups. However, for large population studies (n > 5,000), it is recommended to run the location association analysis using the state-of-art linear mixed models, such as LMM-BOLT (Loh et al. 2015) or SAIGE (Zhou et al. 2018) and supply the results as location summary statistics for the joint analysis.
The novel contribution of gJLS2 is the addition of our recommended X-chromosome location association strategy (Chen et al. 2021), which has good type I error control, is robust to sex confounding, arbitrary baseline allele choice, uncertainty of X-Chromosome Inactivation (XCI), and skewed XCI. Our chosen X-chromosome association model achieves these by simultaneously including the sex information, its interaction with both the additive genetic effect and dominance effect. The default location test P-value is obtained by testing the null hypothesis where G A is the additively coded genotype variable, G D is an indicator variable for the female heterozygotes, the sex variable S is coded with males as the baseline taking value 0 and females taking value 1, and C is a vector of covariates to be adjusted for. The regression coefficients b 0 ; c; b S are for the nongenetic covariates in the model and b G ; b GS ; b D denote the regression coefficients of interest, for the additive, GxSex, and dominance effects, respectively. The function "locReg" can be called with the X-chromosome option and returns the default 3-degree of freedom (df) test P-value:

Scale association
The scale component builds on 2 recent works that generalized to dosage genotype, related samples (Soave and Sun 2017), and the X-chromosome (Deng et al. 2019). Both are extensions of the generalized Levene's test via a regression framework. Besides a more flexible characterization of sample dependence structure, the generalization also allows analysis of imputed genotype data, which would otherwise be challenging with a group-based variance heterogeneity test, such as the Levene's test or the Brown-Forsythe test.
The variance test P-value is obtained using a 2-stage generalized Levene's test assuming additive variance effects. Briefly, the residuals d was computed under Equation (1) using the Least Absolute Deviation (LAD) algorithm, which gives the residuals in terms of each observation's distance with respect to the median (rather than the mean as in OLS) and is the default option. The absolute residuals were then fitted under the 3-df recommended model for discrete genotype: The following examples show results from LAD and OLS algorithms are very similar when the simulated phenotype is roughly symmetric: The imputed data are analyzed either by computing the dosage value and used in place of G additively (2 df) without the dominance term, or by replacing the genotype indicators for each observation, with the corresponding group probabilities (3 df). Similar to the location association, sample relatedness is dealt with using GLS for autosomal markers at the second stage of linear regression via the correlation matrix. Additional models, such as a sex-stratified variance test, can also be specified for X-chromosome. In this case, the scale test result is given by the Fisher's method that combines female and male-specific variance test results.
Note that a "Flagged" column is appended for these results, indicating the minimum genotype count in either females/males is less than 30 (indicated by a value of 1) or not (indicated by 0). This is based on the quality control that SNPs with a minimum count below 30 should be removed to avoid inflated type I errors Deng et al. 2019).

Joint location-and-scale analysis
The joint analyses can be performed automatically as part of the gJLS2 pipeline after running location and scale tests, where the default option applies a quantile transformation to the quantitative trait such that the location and scale test can be combined without inflating the type I error rates.
Alternatively, summary statistics, i.e. P-values from location and scale tests are allowed and can be combined via Fisher's method to give the corresponding test statistic for the joint analysis: where gL denotes the location P-value and gS the scale P-value. An important assumption underlying this simple method to combine evidence is the normality of the quantitative trait, which leads to gL and gS being independent under the null hypothesis. Though empirically the analyses remain valid for approximated normal distributions (Soave and Sun 2017), we strongly recommend the user to follow the default option and quantile transform the quantitative trait for location and scale association. We expect the independence between gL and gS to hold for X-chromosomal SNPs under appropriate location and scale models that account for the confounding effect of sex, XCI uncertainty and skewed XCI (Chen et al. 2021). The simulation study in Section 2 of the Supplementary Material was conducted to help readers gauge the effect of non-normality on the gJLS P-values for X-chromosome markers.  Scalability and using gJLS2 in non-GUI settings The main contribution of the software is the ability to handle Xchromosome (location and scale) analysis and it is not our aim to compete with existing autosomal association tools that are geared toward whole-genome computations. As a result, for both location and scale association, data splitting remain our primary strategy to deal with biobank scale data. We also implemented the summary statistics option to encourage the inclusion of genome-wide results (autosome) computed using existing software as inputs for the location portion. For X-chromosome associations, we recommend using gJLS2 in non-GUI settings and provide 2 practical options for biobank-scale data. For larger datasets, it is more convenient to run the joint analyses using the PLINK R plug-in following the a typical GWAS pipeline. The R plug-in relies on the Rserve package to establish communication between R and PLINK 1.9. The following script demonstrates the joint analyses for X-chromosome SNPs that included additional covariates.
Another option is to use the Rscript provided that allows additional arguments to change how frequently the results are written to the output (-write) and to increase the number of cores used (-nTasks). A core is an independent processing unit on a central processing unit (CPU). Though a modern computer, usually containing 4-8 cores, is capable of handling parallel computing, we recommend the "split-apply-combine" strategy and employing high-performance computing for large-scale analyses. Scripts for both options are available from github (https://github. com/WeiAkaneDeng/gJLS2/tree/main/inst/extdata).
To help assess the computational requirement at different data sizes, we sampled with replacement from the UKB X-chromosome data (restricted to 100 SNPs) to achieve sample sizes of: 1,000, 5,000, 10,000, 50,000, 100,000, 200,000, 300,000, 400,000, and repeated the joint-location-scale analysis using a single core with 10GB memory via (1) PLINK R plug-in and (2) Rscript. The reason for keeping the 100 SNPs to estimate the performance metric is because the analysis can be easily divided to chunks and combined after. Figure 1 shows the computational time and memory usage as a function of increasing sample size. These results suggest that the gJLS analysis on data with sample size up to 10,000 on less than 100 SNPs can be suitably handled in R GUI such as Rstudio, which might be the preferred option for confirmatory analyses.
For an X-chromosome wide analyses on UKB (n ¼ 488,377, m ¼ 15,179), the gJLS analysis took $16 h for PLINK (using 3.5GB memory) and $ 21 h (using 1.2GB) for Rscript. The memory efficiency of Rscript is expected as the "BEDmatrix" only maps the required portion of genotype files into memory. However, the Rscript can be parallelized, when using 4 cores with 20GB allocated memory, the wall clock run time was reduced to $13 h (using 14.0 GB).

Results and Discussion
The gJLS2 software supports the joint analysis on both individuallevel data as well as summary statistics. To highlight the functions in our package, we present (1)

Application to UK Biobank data
The sex-stratified means and variances for these quantitative traits are summarized graphically in Supplementary Figs. 1-4. A quick visual inspection suggests that quantile normalization should be applied, which is also the default options in gJLS2. We restricted analyses to white British samples (n ¼ 276,694), and for BMI, we further excluded those with diagnosed type 2 diabetes (n ¼ 262,837). We included only bi-allelic SNPs and filtered based on MAF < 0.01, HWE P-value <1E-5. A further check on sexstratified MAFs ( Supplementary Fig. 5) confirmed that the remaining 15,179 X-chromosomal SNPs are of good quality.
The genotype data can be supplied in PLINK binary format via the R plug-in option using PLINK 1.9 (Chang et al. 2015) or any other format that can be read in R with packages "BGData" and "BEDMatrix" (Grueneberg and de los Campos 2019). The phenotype and covariate should ideally be supplied in the same file, and if not, should have a common column to match samples. For genome-wide analyses on larger datasets, we recommend the use of a high-performing computing cluster and taking advantage of multiple cores whenever available.
Since the PLINK plug-in option only supports single-thread computation, we fixed the number of cores to be 1. There is no drastic difference between the 2 options, both took $20 h with 20G allocated memory (computing node specs 2xIntel E5-2683 v4 Broadwell @ 2.1 GHz), to complete the analysis for 276,694 unrelated European samples and 15,179 X-chromosome variants per trait. The gL, gS, and gJLS2 P-values are presented using a Manhattan plot, quantile-quantile plot, and a histogram  for each of the complex traits. We also tabulated a list of SNPs that did not pass the genome-wide significance threshold of 5E-8 for gL, but did pass for gJLS in Table 1, demonstrating the benefit of gJLS for additional genome-wide discoveries.

Application to summary statistics
The Rscript is more flexible than the PLINK plug-in solution as it can also handle analysis of summary statistics. The input file should contain at least 3 columns with headers "SNP," "gL," and "gS," while the output file has an additional column "gJLS." We re-formatted the subset of chromosome 16 summary statistics of location and scale obtained from the GIANT consortium for BMI and height as inputs.
The gL, gS, and gJLS2 P-values are presented using a Manhattan plot, quantile-quantile plot, and a histogram ( Supplementary Figs. 10 and 11). For chromosome 16, we gained 2 SNPs in DNAH3 gene that passed the genome-wide significance using gJLS for BMI and an additional 23 SNPs for height (Supplementary Table 1).

Concluding remarks
The gJLS2 software package is a versatile tool for genome-wide discovery that is X-chromosome inclusive. As compared to previous versions, namely, JLS and gJLS, it has improved remarkably in terms of methodology, flexibility, computational improvements and most importantly, usability for large-scale data. Meanwhile, we expect many new features to be added with ongoing work. Naturally, the analysis of rare variants is a possible future direction for scale association test under the regression framework of a generalized Levene's test. With the available summary statistics, a systematic approach to prioritize SNPs that takes into account the joint-location, scale effects, and functional annotation is needed to produce relevant candidates for gene-environment interactions. Finally, apart from the improved signal detection, the unbiased estimation of location effects for X-chromosome and scale effects in general remain open questions, but are expected to yield improved performance of polygenic prediction. Fig. 1. Computational usage of gJLS X-chromosome association analyses. The bars represent the memory in (Gigabyte) and wall clock time (seconds) used to perform a gJLS X-chromosome association for 100 markers at various sample sizes (x-axis, per a thousand samples).