DeepPheWAS: an R package for phenotype generation and association analysis for phenome-wide association studies

Abstract Summary DeepPheWAS is an R package for phenome-wide association studies that creates clinically curated composite phenotypes and integrates quantitative phenotypes from primary care data, longitudinal trajectories of quantitative measures, disease progression and drug response phenotypes. Tools are provided for efficient analysis of association with any genetic input, under any genetic model, with optional sex-stratified analysis, and for developing novel phenotypes. Availability and implementation The DeepPheWAS R package is freely available under GNU general public licence v3.0 from at https://github.com/Richard-Packer/DeepPheWAS. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Phenome-wide association studies (PheWASs) can be used to better understand the pleiotropic effects of genetic variants (Tyler et al., 2016) and to inform drug development through target identification, target validation and use of variants that mimic drug effects to assess likely drug efficacy, safety and drug repurposing opportunities (Diogo et al., 2018;Gill et al., 2019;Khosravi et al., 2019). PheWASs comprise two stages-phenotype generation and statistical association tests. There have been two widely applicable methods for phenotype generation: PHESANT (Millard et al., 2018) and PheWAS-R (Carroll et al., 2014). PHESANT creates phenotypes by extracting study-specific questionnaire and measurement data alongside linked hospital records in UK Biobank. PheWAS-R combines related international classification of disease version 9 and 10 (ICD-9/ICD-10) codes into clinically relevant groups termed phecodes. Both tools provide regression analysis for per-variant PheWAS for generated phenotypes using generalized linear models in R and produce Manhattan plots. Online PheWAS resources such as Open Targets Genetics (Ghoussaini et al., 2021) do not perform new statistical tests. Instead, they are repositories for existing results from phenotypes generated by one of the two above tools or by individual genome-wide association studies (GWAS).
These tools are useful, but have several key gaps: 1. The phenotypes generated rely on a single data field or coding ontology and do not take advantage of all available data, such as primary care data; 2. Existing approaches do not provide tools for developing new phenotypes; 3. For running per-variant PheWAS, running each regression model in R is computationally inefficient and can result in inflated type I error for low-frequency variants with a case-control imbalance (Ma et al., 2013); 4. Online resources such as Open Targets Genetics have limited flexibility. For example, they accept only single nucleotide polymorphisms (SNPs) and retrieve results only for genetic models tested. The user cannot specify when to use new data fields or updates to existing data fields (e.g. updated health records), and the user cannot specify their preferred statistical approach and outputs, such as false discovery rate (FDR).
The platform we have developed, DeepPheWAS, addresses both phenotype generation and efficient association testing while incorporating the following developments that are not yet available in any single current platform or online resource: i. Clinically curated composite phenotypes for selected health conditions that integrate different data types (including primary and secondary care data) to study phenotypes not well captured by current classification trees; ii. Integration of quantitative phenotypes from primary care data, such as pathology records and clinical measures; iii. Integration of disease progression phenotypes, longitudinal trajectories of quantitative measures and drug response measures; iv. Clinically curated phenotype selection for traits that are extremely highly correlated; v. Efficient association testing, and type-1 error control using PLINK 2 firth fall-back regression. vi. Flexible tests of additive, dominant, recessive and genotypic models; vii. Inclusion of complex variants, such as copy number variants with a wide range of copy numbers (multiallelic CNVs); viii. Ability to test genetic risk scores; ix. Creation of phenotypes in sex-specific strata to run a sexstratified PheWAS; x. Providing tools for generating novel phenotypes using a simple phenotype mapping process.
2 Application of DeepPheWAS to UK biobank

Analysis of quantitative phenotypes
Our package can be applied to quantitative phenotypes derived from numerous data sources, including primary care data. For example, we created a phenotype using recorded levels of blood sodium in primary care records that is not yet included in any PheWAS platform. We applied DeepPheWAS to rs7193778 (nearest genes NFAT5 and TERF2), previously associated with urate levels (Kö ttgen et al., 2013). Our PheWAS shows various associations which are currently not documented in GWAS Catalog, most strongly with blood sodium levels ( Supplementary Fig. S1, Supplementary Table S1).

Highly correlated traits
We applied our DeepPheWAS approach to rs2912062 (nearest genes ANGPT2 and AGPAT5), shown to be associated with carotid intima-media thickness (IMT) (Strawbridge et al., 2020), a phenotype not currently available in any PheWAS platform. By selecting a single representative measure taken from many individual measurements, DeepPheWAS can collapse highly correlated quantitative traits into single measures (in this case carotid IMT maximum and carotid IMT mean), reducing redundancy and improving power. We recapitulated known GWAS findings ( Supplementary Fig. S2, Supplementary Table S2).

Association tests for complex structural variation
Human genomic variation includes variants which have more categories than SNPs. For example, the diploid human copy number of CCL3L1 ranges from 0 to 8 in UK Biobank participants (Fawcett et al., 2022) (Supplementary Fig. S5). In such situations, association testing may be based on the measured copy number or on userspecified collapsed categories, requiring a flexible platform. We used DeepPheWAS to test association with CCL3L1 copy number (coded 0-8) under a linear additive model; no associations reached an FDR threshold of 1% ( Supplementary Fig. S6, the top 5 associations are shown in Supplementary Table S5), this recapitulates findings from earlier studies (Adewoye et al., 2018;Carpenter et al., 2011;Field et al., 2009;Urban et al., 2009).

Genetic risk scores, composite and diseaseprogression phenotypes
Genetic risk scores (GRS) aggregate multiple SNPs, providing improved power for studying phenotypic associations, but cannot be specified in online PheWAS platforms. We performed a PheWAS using a 279-variant GRS for FEV 1 /FVC (Shrine, 2019), which showed association (FDR < 1%) with 47 traits including increased risk of clinical COPD and clinical asthma with a higher score of FEV 1 /FVC reducing alleles ( Supplementary Fig. S7, Supplementary Table S6). Furthermore, the composite phenotypes generated by the DeepPheWAS platform (e.g. P2020 Asthma and P2054 COPD) were consistently more strongly associated with the GRS for FEV 1 /FVC than the relevant Phecodes alone. We also show significant association with the novel disease-progression phenotypes: exacerbation of COPD and age-of-onset of COPD both of which are unavailable in existing PheWAS resources and have published GWAS results.

Implementation
DeepPheWAS is an R package that can be run on high-performance computing clusters and requires R 4.1.0 and PLINK 2.0. DeepPheWAS is optimized for UK Biobank data and is expected to be interoperable with the UK Biobank Research Analysis Platform, further details on required data can be seen in the supplement and on https://richard-packer.github.io/DeepPheWAS_site/.

Availability
The DeepPheWAS R package is freely available under GNU general public licence v3.0 from at https://github.com/Richard-Packer/ DeepPheWAS.

Conclusion
Here, we present DeepPheWAS, an R package that facilitates phenome-wide association studies while addressing several limitations of existing approaches. This includes the ability to analyse a broader range of phenotypes derived from large-scale electronic healthcare records, more informative composite phenotypes, greater flexibility in the type of genetic variation that can be studied and assessing associations with genetic risk scores.