Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics

Abstract Motivation Measuring genetic diversity is an important problem because increasing genetic diversity is a key to making new genetic discoveries, while also being a major source of confounding to be aware of in genetics studies. Results Using the UK Biobank data, a prospective cohort study with deep genetic and phenotypic data collected on almost 500 000 individuals from across the UK, we carefully define 21 distinct ancestry groups from all four corners of the world. These ancestry groups can serve as a global reference of worldwide populations, with a handful of applications. Here, we develop a method that uses allele frequencies and principal components derived from these ancestry groups to effectively measure ancestry proportions from allele frequencies of any genetic dataset. Availability and implementation This method is implemented in function snp_ancestry_summary of R package bigsnpr. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Several projects have focused on providing genetic data from diverse populations, such as the HapMap project, the 1000 genomes project (1KG), the Simons genome diversity project and the human genome diversity project (1000Genomes Project Consortium et al., 2015Bergströ m et al., 2020;International HapMap 3 Consortium et al., 2010;Mallick et al., 2016). However, these datasets do not contain many individuals per population and therefore are not large enough for some purposes, such as accurately estimating allele frequencies for diverse worldwide populations. The UK Biobank (UKBB) project is a prospective cohort study with deep genetic and phenotypic data collected on almost 500 000 individuals from across the UK. Despite being a cohort from the UK, this dataset is so large that it includes individuals that were born in all four corners of the world. Therefore, the UKBB can serve as a global reference of worldwide populations when used in its entirety, i.e. without discarding valuable multiancestry genetic data.

Implementation
Here, we carefully use information on self-reported ancestry, country of birth and genetic similarity to define 21 distinct ancestry groups from the UKBB to be used as global reference populations, which is the first innovation of this paper. These include nine groups with genetic ancestries from Europe, four from Africa, three from South Asia, three from East Asia, one from the Middle East and one from South America (which are later merged into 18 groups in Table 1). The detailed procedure used to construct these reference ancestry groups is presented in the Supplementary Materials. As a direct application of these groups, we propose a new method to estimate global ancestry proportions from a cohort based on its allele frequencies only (i.e. summary statistics). Arriaga-MacKenzie et al.
(2021) previously proposed method Summix, which finds the convex combination of ancestry proportions a k (positive and sum to 1) which minimizes the following problem: where M is the number of variants, K the number of reference populations, f Here, we provide reference allele frequencies for 5 816 590 genetic variants across 21 diverse ancestry groups (which are later merged into 18 groups in Table 1). Moreover, we rely on the projection of our reference allele frequencies onto the PCA (principal component analysis) space computed from the corresponding UKBB  Note: Note that, because they are very close ancestry groups, we merge a posteriori the ancestry coefficients a k from 'Ireland', 'United Kingdom' and 'Scandinavia' into a single 'Europe (North West)' group, and similarly for 'Europe (North East)' and 'Europe (South East)' into a single 'Europe (East)' group. Citations for the allele frequencies ; with the same convex constraints on ancestry proportions a k , and where L is the number of PCs, p ðkÞ l is the projection of allele frequencies from population k onto PC l and p ð0Þ l is the (corrected) projection of allele frequencies from the cohort of interest onto PC l. Note that we need to correct for the shrinkage when projecting a new dataset (here the allele frequencies from the GWAS summary statistics) onto the PC space (Priv e et al., 2020). Finding the ancestry proportions in the PCA space (rather than using the allele frequencies directly) provides more power to distinguish between close populations, which is the second innovation of this paper. This enables us to use more reference populations in order to get a more fine-grained measure of genetic diversity.
The steps required by the proposed method are then 1/read all summary statistics datasets into R, i.e. the reference allele frequencies and corresponding PC loadings we provide for download as well as the GWAS summary statistics containing the allele frequencies of interest; 2/match variants and alleles between summary statistics and the reference allele frequencies we provide; 3/project allele frequencies onto the PCA space (matrix multiplication); 4/solve the final (small) quadratic programming problem, by relying on R package quadprog (Turlach et al., 2019). Steps 3 and 4 are now implemented in function snp_ancestry_summary in our R package bigsnpr (Priv e et al., 2018).
Step 2 can be performed using existing function snp_match. A tutorial is provided at https://privefl. github.io/bigsnpr/articles/ancestry.html. All these steps are very fast and overall require a few minutes only for GWAS summary statistics with millions of variants.

Results
We download several genome-wide association study (GWAS) summary statistics for which allele frequencies are reported and apply this new method to them. We first apply function snp_ances-try_summary to more homogeneous samples as an empirical validation; when applying it to the Biobank Japan (BBJ; Japanese cohort), FinnGen (Finnish), a Peruvian cohort, a Qatari cohort and Sub-Saharan African cohort, the ancestry proportions obtained match expectations (Table 1). When comparing our estimates with reported ancestries for more diverse cohorts, for example PAGE is composed of 44.6% Hispanic-Latinos, 34.7% African-Americans, 9.4% Asians, 7.9% Native Hawaiians and 3.4% of some other ancestries (self-reported), whereas our estimates are of 25.4% South American, 22.6% European (including 15.8% from South-West Europe), 34.1% African, 2.7% South Asian, 10.6% East Asian and 4.6% Filipino. GWAS summary statistics from either European ancestries or more diverse ancestries all have a substantial proportion estimated from European ancestry groups, while ancestries from other continents are still largely underrepresented (Table 1).
We then perform three secondary analyses. First, we compare the results obtained previously in Table 1 with the results we would get without using the PCA projection of allele frequencies (i.e. equivalent to the Summix method). The resulting ancestry proportions are presented in Supplementary Table S1 and are clearly less precise for BBJ and FinnGen. Second, we compare previous results with the ones obtained using a smaller number of variants, by randomly sampling 100 000 variants to run the proposed method. The resulting ancestry proportions are presented in Supplementary Table  S2 and are highly consistent with the ones from Table 1, showing that 100 000 overlapping variants are enough to run the proposed method. Third, we also infer ancestry proportions for all 345 individuals of the Simons genome diversity project (Mallick et al., 2016) using the reference allele frequencies we provide and two methods. We use either our proposed method with the genotypes of an individual divided by 2 in place of allele frequencies, or by using the projection analysis of ADMIXTURE (-P, Shringarpure et al., 2016). Results are very consistent between the two methods, and are overall as expected, further validating the proposed ancestry groups and the proposed method to infer ancestry proportions, which seems very precise even at the individual level.

Discussion
Here, we have identified an unprecedentedly large and diverse set of ancestry groups within a single cohort, the UKBB. Using allele frequencies and PCs derived from these ancestry groups, we show how to effectively measure diversity from GWAS summary statistics reporting allele frequencies. Measuring genetic diversity is an important problem because increasing genetic diversity is key to making new genetic discoveries, while also being a major source of confounding to be aware of in genetics studies. Our work has limitations though. First, it is unknown whether we can effectively capture any existing ancestry as a combination of the 21 reference populations we defined. For example, it seems that Native Hawaiians in the PAGE study are partly captured by the "Philippines" ancestry group we define. Second, with the 21 ancestry groups we define, we probably capture a large proportion of the genetic diversity in Europe, but more fine-grained diversity in other continents may still be lacking. Third, when using the allele frequencies reported in the GWAS summary statistics, it is not clear whether they were computed from all individuals (i.e. before performing any quality control and filtering), and, for meta-analyses of binary traits, whether they were computed as a weighted average of total or effective sample sizes. Despite these limitations, we envision that the ancestry groups we define here will have many useful applications. The presented method that uses these groups could e.g. be used to automatically report ancestry proportions in the GWAS Catalog (MacArthur et al., 2017). These ancestry groups could also be used for assigning ancestry in other cohorts using the PC projection from this study (Priv e et al., 2022), assessing the portability of polygenic scores (Priv e et al., 2022) or deriving linkage disequilibrium references matching GWAS summary statistics from diverse ancestries.

Software and code availability
The newest version of R package bigsnpr can be installed from GitHub (see https://github.com/privefl/bigsnpr) and a recent enough version can be installed from CRAN. A tutorial on ancestry proportions and ancestry grouping is available at https://privefl.github.io/ bigsnpr/articles/ancestry.html. The set of reference allele frequencies for 5 816 590 genetic variants across 21 diverse ancestry groups defined here can be downloaded at https://figshare.com/ndown loader/files/31620968 and PC loadings for all variants across 16 PCs at https://figshare.com/ndownloader/files/31620953. All codes used for this paper are available at https://github.com/privefl/freq-an cestry/tree/main/code. We have extensively used R packages bigstatsr and bigsnpr (Priv e et al., 2018) for analyzing large genetic data, packages from the future framework (Bengtsson, 2021) for easy scheduling and parallelization of analyses on the high-performance computing cluster and packages from the tidyverse suite (Wickham et al., 2019) for shaping and visualizing results.