-
PDF
- Split View
-
Views
-
Cite
Cite
Xiuwen Zheng, J Wade Davis, SAIGEgds—an efficient statistical tool for large-scale PheWAS with mixed models, Bioinformatics, Volume 37, Issue 5, March 2021, Pages 728–730, https://doi.org/10.1093/bioinformatics/btaa731
- Share Icon Share
Abstract
Phenome-wide association studies (PheWASs) are known to be a powerful tool in discovery and replication of genetic association studies. To reduce the computational burden of PheWAS in the large cohorts, such as the UK Biobank, the SAIGE method has been proposed to control for case–control imbalance and sample relatedness in a tractable manner. However, SAIGE is still computationally intensive when deployed in analyzing the associations of thousands of ICD10-coded phenotypes with whole-genome imputed genotype data. Here, we present a new high-performance statistical R package (SAIGEgds) for large-scale PheWAS using generalized linear mixed models. The package implements the SAIGE method in optimized C++ codes, taking advantage of sparse genotype dosages and integrating the efficient genomic data structure file format. Benchmarks using the UK Biobank White British genotype data (N ≈ 430 K) with coronary heart disease and simulated cases show that the implementation in SAIGEgds is 5–6 times faster than the SAIGE R package. When used in conjunction with high-performance computing clusters, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS.
https://bioconductor.org/packages/SAIGEgds; vignettes included.
Supplementary data are available at Bioinformatics online.
1 Introduction
A phenome-wide association study (PheWAS) is known to be a powerful tool in discovery and replication of genetic association studies where it could be used for interrogating relationships between multiple phenotypes and targeted genetic markers (Denny et al., 2013). The advent of the UK Biobank (UKB) resource, a large prospective cohort study comprising ∼500 000 individuals from the UK with deeply phenotype data and about 90 million imputed genetic variants (Bycroft et al., 2018), presents tremendous opportunity for discovering genetic associations. The unprecedented sample size of genome-wide association studies (GWASs) in this cohort poses significant analytical and computational challenges. Mixed-model associations allow for adjusting family relatedness, while avoiding nearly 30% sample exclusions in linear regressions and resulting substantial loss of statistical power (Loh et al., 2018). However, the computational costs of mixed model in PheWAS are beyond the means of many research labs.
To reduce the computational burden of mixed models in terms of memory and CPU hours, BOLT-LMM was proposed recently, which accomplishes this in part by not storing a large pre-computed genetic relationship matrix (GRM) (Loh et al., 2015). To correct for case–control imbalance, the method scalable and accurate implementation of generalized mixed model (SAIGE) was further developed using the saddlepoint approximation (Zhou et al., 2018). SAIGE provides accurate P-values even for extremely unbalanced case–control comparisons, which are common in large population cohort studies, such as UKB. However, it is still computationally burdensome to analyze the associations of thousands of phenotypes with whole-genome imputed genotypes, especially for disease diagnoses using the ICD-10 codes in UKB.
Here, we describe a new accelerated implementation of SAIGE, named SAIGEgds, that is scalable for large-scale PheWAS using mixed models. SAIGEgds is a Bioconductor package that leverages the Genomic Data Structure (GDS) format to provide efficient mixed-model analyses entirely within the R environment (Zheng et al., 2017). GDS provides the same capabilities as Variant Call Format, and has been used in genetic association studies as well as a fast means of exploring sample relatedness and principal component analysis (Gogarten et al., 2012, 2019; Zheng et al., 2012, 2017). The recent application of GDS includes whole-genome sequencing associations in a cloud-based environment in the Trans-Omics for Precision Medicine project (Brody et al., 2017).
2 Features
To support efficient genetic data manipulation, SeqArray provides a framework to store and analyze integer genotypes, imputed numeric dosages and probabilities (Zheng et al., 2017). The original imputed genotypes of ∼90 million variants in UKB data are stored in BGEN format. We used the gds2bgen R package to convert BGEN to GDS files, so the dosages are directly stored in GDS files instead of genotype probabilities in BGEN. The LZ4 algorithm, known for extreme decompression speed, is used to compress dosages in GDS files. Total ∼2000 CPU hours were needed to convert 2.3 TiB UKB BGEN files to 1.8 TiB GDS files (Supplementary Fig. S1), but this is a one-time cost and an embarrassingly parallel calculation. For example, with a single multicore (20 core) CPU, all chromosomes can be converted in <5 days, while if the user has access to more nodes, this can be parallelized by chromosome and completed in <10 h, as no single chromosome took longer than 10 h to convert to GDS (Supplementary Fig. S1).
In a mixed-model GWAS, a model under the null hypothesis of no genetic association is fitted, and the P-value is calculated for each variant based on score test. To avoid a large pre-computed GRM, the SAIGE method uses in-memory SNPs to fit the null model via preconditioned conjugate gradient. SAIGEgds stores the in-memory SNPs in a more computationally efficient sparse form instead of a 2-bit dense matrix in SAIGE, at the cost of memory usage if using common variants.
For P-value calculations, the slower R code in SAIGE is replaced by a highly optimized C++ implementation in SAIGEgds with a sparse genotype vector. We also implement the saddlepoint approximation method in C. Multicore and job-level parallelism are supported via the package ‘parallel’ in R, where the variants can be chunked for efficient P-value calculations.
3 Performance
In the benchmarks, 430 K White British individuals and 4.7 million imputed variants on Chromosome 1 were used to compare the computation time of the SAIGEgds and SAIGE packages. R v3.6.0, SAIGE v0.29.4.4, SAIGEgds v1.0.0, gdsfmt v1.22.0 and SeqArray v1.26.0 were installed and all C/C++ codes were compiled with GCC v7.4.0.
The case–control phenotypes in the tests are simulated data with an independent random variable consisting of 1% cases, as well as real phenotype data from coronary artery disease (CAD, I25.1), which has a 4.6% prevalence in the UKB data. CAD is the most common type of cardiovascular disease and is the leading cause of death in the USA, affecting nearly half of American adults (Benjamin et al., 2019).
The benchmarks show that SAIGEgds is 5–6 times faster than the SAIGE R package in the steps of fitting null models and single variant P-value calculations (Fig. 1). SAIGEgds used more memory than the SAIGE package in Step 1 (18.6 G versus 11.8 G), but less memory in Step 2 (0.9 G versus 1.5 G). When multiple cores are used in Step 2, the memory usage scales linearly with the number of processes. Loading and decompressing dosages from GDS files occupies ∼20% of running time of SAIGEgds in Step 2.

Wall-clock time in the benchmarks of single variant test comparing SAIGEgds with SAIGE. (CPU: Intel® Xeon® E5-2640 v4 @2.40 GHz)
The P-values are consistent between these two packages, especially when it is <10−4. The discrepancy in P-values can be explained by different precisions of real numbers for dosages, and three storage strategies (1, 2 and 4 bytes for a number) are compared in Supplementary Figure S2. The P-value difference decreases when a higher numeric precision is used, at the cost of file size. To further compare P-values with a broader scope of prevalence, we randomly selected 20 ICD10-coded phenotypes with a wide range of frequencies from 0.01% to 13.3%. The small P-values are consistent when a byte is used to store a dosage in a GDS file, and the discrepancy is reduced when higher precisions are used (Supplementary Fig. S3).
In the PheWAS analyses of single variant testing, 430 K White British individuals and 977 ICD10-coded binary phenotypes were used (Supplementary Table S1). For each disease phenotype, the controls are the individuals without that disease. The covariates in the mixed model are sex and age at recruitment. The GRM was estimated from a LD-pruned autosomal set of directly genotyped SNPs. A high-performance cluster in the National Center for Supercomputing Applications at the University of Illinois was used to conduct the PheWAS analysis. The average CPU hour of SAIGEgds per phenotype was 118.2±13.1 (Supplementary Fig. S4), and ∼115 500 CPU hours in total were required for the 977 phenotypes, while the original SAIGE package needs an estimated 600 000 CPU hours. In Supplementary Figure S5, the association inflation factor increases with the number of cases in general, especially for >10 000 cases, but this effect is not specific to SAIGEgds as it is a general property of the SAIGE approach (data not shown). In the future version of SAIGEgds, we plan to implement multiple random factors in mixed models, and set-based aggregate tests with the burden and ACAT methods (Liu et al., 2019).
Acknowledgements
The authors thank Wei Zhou (University of Michigan) for supporting the SAIGEgds package (no funding was provided), and AbbVie employees Jason Grundstad, Mark Reppell, Sahar Esmaeeli and Fedik Rahimov for using the utilities.
Funding
The authors are employees of AbbVie and may hold stock in AbbVie. The design, study conduct and financial support for this research were provided by AbbVie. AbbVie participated in the interpretation of data, review and approval of the publication.
Conflict of Interest: The authors are affiliated with AbbVie.
References