Abstract

The UK Biobank is a prospective study of more than 500 000 participants, which has aggregated data from questionnaires, physical measures, biomarkers, imaging and follow-up for a wide range of health-related outcomes, together with genome-wide genotyping supplemented with high-density imputation. Previous studies have highlighted fine-scale population structure in the UK on a North-West to South-East cline, but the impact of unmeasured geographical confounding on genome-wide association studies (GWAS) of complex human traits in the UK Biobank has not been investigated. We considered 368 325 white British individuals from the UK Biobank and performed GWAS of their birth location. We demonstrate that widely used approaches to adjust for population structure, including principal component analysis and mixed modelling with a random effect for a genetic relationship matrix, cannot fully account for the fine-scale geographical confounding in the UK Biobank. We observe significant genetic correlation of birth location with a range of lifestyle-related traits, including body-mass index and fat mass, hypertension and lung function, even after adjustment for population structure. Variants driving associations with birth location are also strongly associated with many of these lifestyle-related traits after correction for population structure, indicating that there could be environmental factors that are confounded with geography that have not been adequately accounted for. Our findings highlight the need for caution in the interpretation of lifestyle-related trait GWAS in UK Biobank, particularly in loci demonstrating strong residual association with birth location.

Introduction

The United Kingdom (UK) is located off the north-western coast of the European mainland and incorporates Great Britain, Northern Ireland and many smaller islands (including the Hebrides, Shetlands and Orkneys). Previous studies have highlighted that population structure within the UK is rather limited, but it occurs at fine-scale on North-South and East-West clines (1,2). Analyses undertaken using genome-wide genotyping data from the People of the British Isles collection identified genetic clusters that are highly localized, separating the Orkney Islands, Scotland and Northern England, Central and Southern England and Wales (3). Such fine-scale structure can lead to false positive signals in genome-wide association studies (GWAS) of traits with characteristics that vary between regions, if not adequately accounted for in the analysis (4).

Multivariate statistical techniques, such as principal component analysis (PCA), have been widely used in population genetics to visualize genotype differences between individuals in few dimensions via eigenvalue decomposition of a genetic relationship matrix (GRM). Axes of genetic variation, derived from PCA, can be used to adjust for population structure by their inclusion as covariates in a generalized linear regression model (5). An alternative, widely used approach to account for population structure is to adjust for the genetic correlation between individuals, as measured by the GRM, which can be included as a random effect in a generalized linear mixed model (6–13). However, the performance of these approaches to adequately account for unmeasured confounding due to fine-scale structure in large, population-based samples has not been evaluated.

The UK Biobank is a very large and detailed prospective study of more than 500 000 participants aged 40–69 years when recruited between 2006 and 2010 (14). The study has aggregated (and continues to collect) extensive information from participants, including data from questionnaires, physical measures, biomarkers, imaging and follow-up for a wide range of health-related outcomes (including linkage to primary care and disease-specific registers). Genome-wide genotyping data, typed on the Affymetrix UK Biobank or BiLEVE arrays, have been centrally called and quality control assessed by the UK Biobank Analysis Team (15), and imputed up to reference panels from the 1000 Genomes Project (16), UK10K Project (17) and Haplotype Reference Consortium (18). PCA was also centrally performed by the UK Biobank Analysis Team to generate axes of genetic variation that can be used to identify participants of similar ancestry and to control for population structure (15).

In this investigation, we first assess the extent of fine-scale population structure in a subset of unrelated white British participants from the UK Biobank using demographic data of reported birth location. We then evaluate the impact of population structure on GWAS of complex human traits in the UK Biobank by considering genetic correlation with birth location and inflation in genome-wide association summary statistics. Finally, we consider locus-specific impact of residual confounding of birth location with complex human traits and demonstrate the effect on association signals of alternative approaches to account for population structure.

Miami plot and quantile–quantile plots for association with Northing and Easting Cartesian co-ordinates for birth location of unrelated white British individuals from the UK Biobank after correction for population structure. Association analyses are performed with inclusion of a random effect for the GRM in a linear mixed model. Inflation factors (λ) assessed via LD-score regression intercept. The genome-wide significance threshold (P < 5 × 10−8) is indicated by the horizontal lines.
Figure 1

Miami plot and quantile–quantile plots for association with Northing and Easting Cartesian co-ordinates for birth location of unrelated white British individuals from the UK Biobank after correction for population structure. Association analyses are performed with inclusion of a random effect for the GRM in a linear mixed model. Inflation factors (λ) assessed via LD-score regression intercept. The genome-wide significance threshold (P < 5 × 10−8) is indicated by the horizontal lines.

Results

Extent of population structure in the UK Biobank

To assess the extent of fine-scale population structure in the UK Biobank, we considered a subset of unrelated white British participants based on self-reported ethnicity and centrally derived axes of genetic variation (Materials and Methods, Supplementary Material, Fig. S1). We then interrogated demographic data of reported birth location, for which UK postcodes had been converted to Easting and Northing Cartesian coordinates, which we refer to as ‘Eastings’ and ‘Northings’, respectively (Supplementary Material, Fig. S2). We excluded individuals with missing birth location and those from the pilot study at the Stockport recruitment center for which the Cartesian coordinates were incorrect. For the remaining 368 325 individuals, we then tested for association of Eastings and Northings with 8 806 946 well-imputed variants with minor allele frequency (MAF) >0.5% in a linear regression model, including only genotyping array as a covariate, as implemented in SNPTESTv2.5.2 (19). To account for population structure, we then considered inclusion of (i) the first ten (or twenty) centrally derived axes of genetic variation from PCA as covariates as implemented in SNPTESTv2.5.2 (19) or (ii) a random effect for the GRM as implemented in BOLT-LMMv2.313 (Materials and Methods, Supplementary Material, Fig. S3).

As expected, there was substantial genome-wide inflation in the association with Northings and Eastings, assessed via the LD-score regression intercept (20), with no correction for population structure (λN = 7.817 and λE = 6.638). Substantial inflation was also observed after adjustment for ten axes of genetic variation as covariates (λN = 3.871 and λE = 1.912), which was not diminished by inclusion of an additional ten axes (Supplementary Material, Fig. S4). The inflation was reduced after inclusion of a random effect for the GRM, but considerable fine-scale population structure remained unaccounted for λN = 1.651 and λE = 1.431 (Fig. 1). We observed no difference in inflation between directly genotyped (λN = 1.650 and λE = 1.436) and imputed variants (λN = 1.648 and λE = 1.428). For this mixed model analysis, we observed strong negative genetic correlation between Northings and Eastings from LD-score regression (21) (rG = −0.660, P = 2.1 × 10−11), confirming previous reports of the North-West to South-East cline in UK population structure (1). The residual association with Northings was more pronounced than for Eastings (Fig. 1). A total of 74 loci attained genome-wide significant evidence of association (P < 5 × 10−8) with Northings after inclusion of a random effect for the GRM (Table 1). The strongest association signals mapped to/near TLR10-TLR1 (rs4543123, pN = 5.3 × 10−56, pE = 2.0 × 10−12) and LCT (rs1849, pN = 1.7 × 10−17, pE = 2.0 × 10−12), both of which have been previously reported as confounded with UK population structure (Supplementary Material, Figs S5 and S6) (1). The toll-like receptor family of genes encode proteins that play a key role in the innate immune system, such that population structure could have arisen through historical geographical differences in exposure to pathogens. The LCT gene encodes the lactase protein that allows lactose tolerance to persist into adulthood and has been subject to positive selection after the domestication of cattle across Europe (22).

Impact of population structure on GWAS of complex human traits in the UK Biobank

We next sought to assess the impact of fine-scale UK population structure on GWAS of complex human traits in the UK Biobank. To do this, we first used LD-score regression (21) to assess the genome-wide genetic correlation between Northings and Eastings (after inclusion of a random effect for the GRM), and selected traits available in the UK Biobank. We utilized published association summary statistics available from LD-Hub (23), obtained from analysis of 337 199 unrelated white British individuals in a linear regression model with adjustment for the first ten centrally derived axes of genetic variation from PCA as covariates (Materials and Methods). Of the 597 traits reported in LD-Hub, we excluded those that were not directly related to health outcomes, lifestyle and/or anthropometric measures (such as current employment, diseases of family members, education and medication). For the remaining 268 traits, we observed significant correlation with Northings (P < 0.00019, Bonferroni correction) for 41 traits (Supplementary Material, Table S1), most of which were broadly related to lifestyle factors, even after adjustment for population structure. A more northerly (and westerly) birth location was genetically correlated with increased body mass index (BMI) and fat mass, alcohol consumption, hypertension and smoking, and with decreased lung function (Fig. 2), suggesting that association signals reported for these traits in UK Biobank could be partially driven by residual confounding with geography that has not been adequately accounted for in the analysis.

Table 1

Loci attaining genome-wide significant association (P < 5 × 10−8) with Northing Cartesian coordinates for birth location of unrelated white British individuals from the UK Biobank after correction for population structure through inclusion of a random effect for the GRM in a linear mixed model

LocusLead variantChrPosition (bp, b37)Mixed model P-value
NorthingsEastings
YTHDF2rs183909650129 059 5531.2 × 10−80.75
MYSM1-JUNrs138938527159 196 6879.3 × 10−100.35
Intergenicrs111849031106 972 3753.4 × 10−80.011
POLR3Crs1413334271145 599 7506.1 × 10−100.55
GJA5-GJA8rs767136131147 307 6664.1 × 10−80.49
FCRLBrs67003691161 691 5862.6 × 10−120.24
KIAA0040rs28611581175 135 8299.4 × 10−90.063
CHRM3rs1424954451239 889 3662.1 × 10−100.023
LPIN1rs869162212 017 8465.3 × 10−90.82
PRKCE-EPAS1rs72795609246 458 3691.3 × 10−80.56
LCTrs1825492136 616 7541.7 × 10−172.0 × 10−12
PDE11Ars753136392178 613 4091.9 × 10−90.083
STAT4rs177681092191 920 4483.2 × 10−80.34
Intergenicrs138897148330 414 0162.4 × 10−80.53
GBE1-LINC00971rs75932529382 986 6851.2 × 10−100.76
Intergenicrs189809665395 199 9001.3 × 10−80.97
Intergenicrs1910771513102 612 5545.6 × 10−90.61
ILDR1rs1479659953121 719 9913.5 × 10−80.78
YEATS2rs1663983183 446 9771.5 × 10−80.53
TLR10-TLR1rs4543123438 792 5245.3 × 10−564.1 × 10−11
Intergenicrs562248335453 210 8263.6 × 10−80.55
AASDHrs10010544457 202 6763.7 × 10−80.23
PARM1-LINC02483rs142147881476 126 2596.9 × 10−90.87
SLC10A7-POU4F2rs1388382114147 525 9482.7 × 10−80.49
LINC02100-RF00017rs144164550518 838 7242.7 × 10−90.96
Intergenicrs11738948544 999 7991.7 × 10−80.31
PART1rs3887175559 790 4564.7 × 10−80.050
CSNK1G3rs28977895122 948 3168.6 × 10−90.047
SMIM33rs131815615138 850 9057.6 × 10−90.00059
RP11-541P9.3rs1855438315162 606 9731.5 × 10−100.29
MHC regionrs67850286632 207 9122.9 × 10−120.27
ANKRD66-MEP1Ars9463249646 747 8643.7 × 10−80.60
RN7SKP211rs776919226106 389 8628.1 × 10−100.49
LINC02534rs5276386816116 060 9673.3 × 10−80.0034
ZC3H12D-PPIL4rs1832115146149 809 2392.7 × 10−90.14
ZNF316rs9640029766 85 1234.5 × 10−80.038
THSD7A-TMEM106Brs12699279711 886 7191.7 × 10−80.27
LOC401324rs7807834735 355 8744.2 × 10−80.95
GTF2IRD2rs145191771774 285 3902.5 × 10−80.96
AC002451.1-DYNC1I1rs73241153795 321 5303.2 × 10−80.013
KLRG2-CLEC2Lrs64678607139 190 0205.8 × 10−90.38
GIMAP4rs69694187150 262 5848.4 × 10−90.97
TUSC3rs12543949815 309 7053.2 × 10−80.55
FGF20rs2467176816 692 6874.9 × 10−80.078
LY96rs11466004874 941 2753.3 × 10−80.37
JRK-PSCArs29202888143 753 2893.7 × 10−90.92
KDM4Crs140546025967 65 3203.9 × 10−80.71
Intergenicrs72712132912 297 6984.7 × 10−80.57
TLR4rs49867909120 475 3025.8 × 10−120.014
PIK3AP1rs125725441098 509 5911.8 × 10−90.57
TMEM180rs7490830610104 233 2299.6 × 10−110.055
NADSYN1-KRTAP5–7rs112340141171 232 8117.2 × 10−101.1 × 10−7
C11orf53rs793498211111 149 6324.6 × 10−90.078
CSRP2rs107462881277 261 0984.6 × 10−110.37
MYBPC1rs1086076612102 064 6674.3 × 10−80.12
GALNT9rs11734032412132 683 2442.4 × 10−80.25
LINC00417-ANKRD20A9Prs95525081319 354 6752.6 × 10−80.65
FLT3rs352631551328 652 9994.5 × 10−80.76
LINC00398-LINC00545rs731650121331 388 7742.9 × 10−80.60
DCAF5rs1437976811469 498 4283.1 × 10−80.69
PWRN2rs5441288061524 494 4124.7 × 10−80.26
SECISBP2L-COPS2rs620097621549 389 7574.5 × 10−80.0053
PRTG-NEDD4rs1502761681556 067 6431.8 × 10−80.34
LINC00923rs727526621598 370 4081.6 × 10−100.25
IFT140rs1174920521616 55 7596.7 × 10−90.19
MC1Rrs18050071689 986 1176.8 × 10−90.46
LINC00670rs1490815601712 503 6491.5 × 10−80.42
ZNF536rs1497136261930 817 2161.9 × 10−80.37
SELENOVrs81022471940 008 1182.1 × 10−90.46
LTBP4-NUMBLrs26048611941 150 9229.4 × 10−90.0026
VSTM2Lrs60134692036 558 6609.7 × 10−90.58
MAFBrs61020862039 281 6902.6 × 10−80.0046
LINC01549rs1932674762118 710 2584.7 × 10−80.31
RUNX1rs5646340642136 479 8122.4 × 10−80.18
LocusLead variantChrPosition (bp, b37)Mixed model P-value
NorthingsEastings
YTHDF2rs183909650129 059 5531.2 × 10−80.75
MYSM1-JUNrs138938527159 196 6879.3 × 10−100.35
Intergenicrs111849031106 972 3753.4 × 10−80.011
POLR3Crs1413334271145 599 7506.1 × 10−100.55
GJA5-GJA8rs767136131147 307 6664.1 × 10−80.49
FCRLBrs67003691161 691 5862.6 × 10−120.24
KIAA0040rs28611581175 135 8299.4 × 10−90.063
CHRM3rs1424954451239 889 3662.1 × 10−100.023
LPIN1rs869162212 017 8465.3 × 10−90.82
PRKCE-EPAS1rs72795609246 458 3691.3 × 10−80.56
LCTrs1825492136 616 7541.7 × 10−172.0 × 10−12
PDE11Ars753136392178 613 4091.9 × 10−90.083
STAT4rs177681092191 920 4483.2 × 10−80.34
Intergenicrs138897148330 414 0162.4 × 10−80.53
GBE1-LINC00971rs75932529382 986 6851.2 × 10−100.76
Intergenicrs189809665395 199 9001.3 × 10−80.97
Intergenicrs1910771513102 612 5545.6 × 10−90.61
ILDR1rs1479659953121 719 9913.5 × 10−80.78
YEATS2rs1663983183 446 9771.5 × 10−80.53
TLR10-TLR1rs4543123438 792 5245.3 × 10−564.1 × 10−11
Intergenicrs562248335453 210 8263.6 × 10−80.55
AASDHrs10010544457 202 6763.7 × 10−80.23
PARM1-LINC02483rs142147881476 126 2596.9 × 10−90.87
SLC10A7-POU4F2rs1388382114147 525 9482.7 × 10−80.49
LINC02100-RF00017rs144164550518 838 7242.7 × 10−90.96
Intergenicrs11738948544 999 7991.7 × 10−80.31
PART1rs3887175559 790 4564.7 × 10−80.050
CSNK1G3rs28977895122 948 3168.6 × 10−90.047
SMIM33rs131815615138 850 9057.6 × 10−90.00059
RP11-541P9.3rs1855438315162 606 9731.5 × 10−100.29
MHC regionrs67850286632 207 9122.9 × 10−120.27
ANKRD66-MEP1Ars9463249646 747 8643.7 × 10−80.60
RN7SKP211rs776919226106 389 8628.1 × 10−100.49
LINC02534rs5276386816116 060 9673.3 × 10−80.0034
ZC3H12D-PPIL4rs1832115146149 809 2392.7 × 10−90.14
ZNF316rs9640029766 85 1234.5 × 10−80.038
THSD7A-TMEM106Brs12699279711 886 7191.7 × 10−80.27
LOC401324rs7807834735 355 8744.2 × 10−80.95
GTF2IRD2rs145191771774 285 3902.5 × 10−80.96
AC002451.1-DYNC1I1rs73241153795 321 5303.2 × 10−80.013
KLRG2-CLEC2Lrs64678607139 190 0205.8 × 10−90.38
GIMAP4rs69694187150 262 5848.4 × 10−90.97
TUSC3rs12543949815 309 7053.2 × 10−80.55
FGF20rs2467176816 692 6874.9 × 10−80.078
LY96rs11466004874 941 2753.3 × 10−80.37
JRK-PSCArs29202888143 753 2893.7 × 10−90.92
KDM4Crs140546025967 65 3203.9 × 10−80.71
Intergenicrs72712132912 297 6984.7 × 10−80.57
TLR4rs49867909120 475 3025.8 × 10−120.014
PIK3AP1rs125725441098 509 5911.8 × 10−90.57
TMEM180rs7490830610104 233 2299.6 × 10−110.055
NADSYN1-KRTAP5–7rs112340141171 232 8117.2 × 10−101.1 × 10−7
C11orf53rs793498211111 149 6324.6 × 10−90.078
CSRP2rs107462881277 261 0984.6 × 10−110.37
MYBPC1rs1086076612102 064 6674.3 × 10−80.12
GALNT9rs11734032412132 683 2442.4 × 10−80.25
LINC00417-ANKRD20A9Prs95525081319 354 6752.6 × 10−80.65
FLT3rs352631551328 652 9994.5 × 10−80.76
LINC00398-LINC00545rs731650121331 388 7742.9 × 10−80.60
DCAF5rs1437976811469 498 4283.1 × 10−80.69
PWRN2rs5441288061524 494 4124.7 × 10−80.26
SECISBP2L-COPS2rs620097621549 389 7574.5 × 10−80.0053
PRTG-NEDD4rs1502761681556 067 6431.8 × 10−80.34
LINC00923rs727526621598 370 4081.6 × 10−100.25
IFT140rs1174920521616 55 7596.7 × 10−90.19
MC1Rrs18050071689 986 1176.8 × 10−90.46
LINC00670rs1490815601712 503 6491.5 × 10−80.42
ZNF536rs1497136261930 817 2161.9 × 10−80.37
SELENOVrs81022471940 008 1182.1 × 10−90.46
LTBP4-NUMBLrs26048611941 150 9229.4 × 10−90.0026
VSTM2Lrs60134692036 558 6609.7 × 10−90.58
MAFBrs61020862039 281 6902.6 × 10−80.0046
LINC01549rs1932674762118 710 2584.7 × 10−80.31
RUNX1rs5646340642136 479 8122.4 × 10−80.18
Table 1

Loci attaining genome-wide significant association (P < 5 × 10−8) with Northing Cartesian coordinates for birth location of unrelated white British individuals from the UK Biobank after correction for population structure through inclusion of a random effect for the GRM in a linear mixed model

LocusLead variantChrPosition (bp, b37)Mixed model P-value
NorthingsEastings
YTHDF2rs183909650129 059 5531.2 × 10−80.75
MYSM1-JUNrs138938527159 196 6879.3 × 10−100.35
Intergenicrs111849031106 972 3753.4 × 10−80.011
POLR3Crs1413334271145 599 7506.1 × 10−100.55
GJA5-GJA8rs767136131147 307 6664.1 × 10−80.49
FCRLBrs67003691161 691 5862.6 × 10−120.24
KIAA0040rs28611581175 135 8299.4 × 10−90.063
CHRM3rs1424954451239 889 3662.1 × 10−100.023
LPIN1rs869162212 017 8465.3 × 10−90.82
PRKCE-EPAS1rs72795609246 458 3691.3 × 10−80.56
LCTrs1825492136 616 7541.7 × 10−172.0 × 10−12
PDE11Ars753136392178 613 4091.9 × 10−90.083
STAT4rs177681092191 920 4483.2 × 10−80.34
Intergenicrs138897148330 414 0162.4 × 10−80.53
GBE1-LINC00971rs75932529382 986 6851.2 × 10−100.76
Intergenicrs189809665395 199 9001.3 × 10−80.97
Intergenicrs1910771513102 612 5545.6 × 10−90.61
ILDR1rs1479659953121 719 9913.5 × 10−80.78
YEATS2rs1663983183 446 9771.5 × 10−80.53
TLR10-TLR1rs4543123438 792 5245.3 × 10−564.1 × 10−11
Intergenicrs562248335453 210 8263.6 × 10−80.55
AASDHrs10010544457 202 6763.7 × 10−80.23
PARM1-LINC02483rs142147881476 126 2596.9 × 10−90.87
SLC10A7-POU4F2rs1388382114147 525 9482.7 × 10−80.49
LINC02100-RF00017rs144164550518 838 7242.7 × 10−90.96
Intergenicrs11738948544 999 7991.7 × 10−80.31
PART1rs3887175559 790 4564.7 × 10−80.050
CSNK1G3rs28977895122 948 3168.6 × 10−90.047
SMIM33rs131815615138 850 9057.6 × 10−90.00059
RP11-541P9.3rs1855438315162 606 9731.5 × 10−100.29
MHC regionrs67850286632 207 9122.9 × 10−120.27
ANKRD66-MEP1Ars9463249646 747 8643.7 × 10−80.60
RN7SKP211rs776919226106 389 8628.1 × 10−100.49
LINC02534rs5276386816116 060 9673.3 × 10−80.0034
ZC3H12D-PPIL4rs1832115146149 809 2392.7 × 10−90.14
ZNF316rs9640029766 85 1234.5 × 10−80.038
THSD7A-TMEM106Brs12699279711 886 7191.7 × 10−80.27
LOC401324rs7807834735 355 8744.2 × 10−80.95
GTF2IRD2rs145191771774 285 3902.5 × 10−80.96
AC002451.1-DYNC1I1rs73241153795 321 5303.2 × 10−80.013
KLRG2-CLEC2Lrs64678607139 190 0205.8 × 10−90.38
GIMAP4rs69694187150 262 5848.4 × 10−90.97
TUSC3rs12543949815 309 7053.2 × 10−80.55
FGF20rs2467176816 692 6874.9 × 10−80.078
LY96rs11466004874 941 2753.3 × 10−80.37
JRK-PSCArs29202888143 753 2893.7 × 10−90.92
KDM4Crs140546025967 65 3203.9 × 10−80.71
Intergenicrs72712132912 297 6984.7 × 10−80.57
TLR4rs49867909120 475 3025.8 × 10−120.014
PIK3AP1rs125725441098 509 5911.8 × 10−90.57
TMEM180rs7490830610104 233 2299.6 × 10−110.055
NADSYN1-KRTAP5–7rs112340141171 232 8117.2 × 10−101.1 × 10−7
C11orf53rs793498211111 149 6324.6 × 10−90.078
CSRP2rs107462881277 261 0984.6 × 10−110.37
MYBPC1rs1086076612102 064 6674.3 × 10−80.12
GALNT9rs11734032412132 683 2442.4 × 10−80.25
LINC00417-ANKRD20A9Prs95525081319 354 6752.6 × 10−80.65
FLT3rs352631551328 652 9994.5 × 10−80.76
LINC00398-LINC00545rs731650121331 388 7742.9 × 10−80.60
DCAF5rs1437976811469 498 4283.1 × 10−80.69
PWRN2rs5441288061524 494 4124.7 × 10−80.26
SECISBP2L-COPS2rs620097621549 389 7574.5 × 10−80.0053
PRTG-NEDD4rs1502761681556 067 6431.8 × 10−80.34
LINC00923rs727526621598 370 4081.6 × 10−100.25
IFT140rs1174920521616 55 7596.7 × 10−90.19
MC1Rrs18050071689 986 1176.8 × 10−90.46
LINC00670rs1490815601712 503 6491.5 × 10−80.42
ZNF536rs1497136261930 817 2161.9 × 10−80.37
SELENOVrs81022471940 008 1182.1 × 10−90.46
LTBP4-NUMBLrs26048611941 150 9229.4 × 10−90.0026
VSTM2Lrs60134692036 558 6609.7 × 10−90.58
MAFBrs61020862039 281 6902.6 × 10−80.0046
LINC01549rs1932674762118 710 2584.7 × 10−80.31
RUNX1rs5646340642136 479 8122.4 × 10−80.18
LocusLead variantChrPosition (bp, b37)Mixed model P-value
NorthingsEastings
YTHDF2rs183909650129 059 5531.2 × 10−80.75
MYSM1-JUNrs138938527159 196 6879.3 × 10−100.35
Intergenicrs111849031106 972 3753.4 × 10−80.011
POLR3Crs1413334271145 599 7506.1 × 10−100.55
GJA5-GJA8rs767136131147 307 6664.1 × 10−80.49
FCRLBrs67003691161 691 5862.6 × 10−120.24
KIAA0040rs28611581175 135 8299.4 × 10−90.063
CHRM3rs1424954451239 889 3662.1 × 10−100.023
LPIN1rs869162212 017 8465.3 × 10−90.82
PRKCE-EPAS1rs72795609246 458 3691.3 × 10−80.56
LCTrs1825492136 616 7541.7 × 10−172.0 × 10−12
PDE11Ars753136392178 613 4091.9 × 10−90.083
STAT4rs177681092191 920 4483.2 × 10−80.34
Intergenicrs138897148330 414 0162.4 × 10−80.53
GBE1-LINC00971rs75932529382 986 6851.2 × 10−100.76
Intergenicrs189809665395 199 9001.3 × 10−80.97
Intergenicrs1910771513102 612 5545.6 × 10−90.61
ILDR1rs1479659953121 719 9913.5 × 10−80.78
YEATS2rs1663983183 446 9771.5 × 10−80.53
TLR10-TLR1rs4543123438 792 5245.3 × 10−564.1 × 10−11
Intergenicrs562248335453 210 8263.6 × 10−80.55
AASDHrs10010544457 202 6763.7 × 10−80.23
PARM1-LINC02483rs142147881476 126 2596.9 × 10−90.87
SLC10A7-POU4F2rs1388382114147 525 9482.7 × 10−80.49
LINC02100-RF00017rs144164550518 838 7242.7 × 10−90.96
Intergenicrs11738948544 999 7991.7 × 10−80.31
PART1rs3887175559 790 4564.7 × 10−80.050
CSNK1G3rs28977895122 948 3168.6 × 10−90.047
SMIM33rs131815615138 850 9057.6 × 10−90.00059
RP11-541P9.3rs1855438315162 606 9731.5 × 10−100.29
MHC regionrs67850286632 207 9122.9 × 10−120.27
ANKRD66-MEP1Ars9463249646 747 8643.7 × 10−80.60
RN7SKP211rs776919226106 389 8628.1 × 10−100.49
LINC02534rs5276386816116 060 9673.3 × 10−80.0034
ZC3H12D-PPIL4rs1832115146149 809 2392.7 × 10−90.14
ZNF316rs9640029766 85 1234.5 × 10−80.038
THSD7A-TMEM106Brs12699279711 886 7191.7 × 10−80.27
LOC401324rs7807834735 355 8744.2 × 10−80.95
GTF2IRD2rs145191771774 285 3902.5 × 10−80.96
AC002451.1-DYNC1I1rs73241153795 321 5303.2 × 10−80.013
KLRG2-CLEC2Lrs64678607139 190 0205.8 × 10−90.38
GIMAP4rs69694187150 262 5848.4 × 10−90.97
TUSC3rs12543949815 309 7053.2 × 10−80.55
FGF20rs2467176816 692 6874.9 × 10−80.078
LY96rs11466004874 941 2753.3 × 10−80.37
JRK-PSCArs29202888143 753 2893.7 × 10−90.92
KDM4Crs140546025967 65 3203.9 × 10−80.71
Intergenicrs72712132912 297 6984.7 × 10−80.57
TLR4rs49867909120 475 3025.8 × 10−120.014
PIK3AP1rs125725441098 509 5911.8 × 10−90.57
TMEM180rs7490830610104 233 2299.6 × 10−110.055
NADSYN1-KRTAP5–7rs112340141171 232 8117.2 × 10−101.1 × 10−7
C11orf53rs793498211111 149 6324.6 × 10−90.078
CSRP2rs107462881277 261 0984.6 × 10−110.37
MYBPC1rs1086076612102 064 6674.3 × 10−80.12
GALNT9rs11734032412132 683 2442.4 × 10−80.25
LINC00417-ANKRD20A9Prs95525081319 354 6752.6 × 10−80.65
FLT3rs352631551328 652 9994.5 × 10−80.76
LINC00398-LINC00545rs731650121331 388 7742.9 × 10−80.60
DCAF5rs1437976811469 498 4283.1 × 10−80.69
PWRN2rs5441288061524 494 4124.7 × 10−80.26
SECISBP2L-COPS2rs620097621549 389 7574.5 × 10−80.0053
PRTG-NEDD4rs1502761681556 067 6431.8 × 10−80.34
LINC00923rs727526621598 370 4081.6 × 10−100.25
IFT140rs1174920521616 55 7596.7 × 10−90.19
MC1Rrs18050071689 986 1176.8 × 10−90.46
LINC00670rs1490815601712 503 6491.5 × 10−80.42
ZNF536rs1497136261930 817 2161.9 × 10−80.37
SELENOVrs81022471940 008 1182.1 × 10−90.46
LTBP4-NUMBLrs26048611941 150 9229.4 × 10−90.0026
VSTM2Lrs60134692036 558 6609.7 × 10−90.58
MAFBrs61020862039 281 6902.6 × 10−80.0046
LINC01549rs1932674762118 710 2584.7 × 10−80.31
RUNX1rs5646340642136 479 8122.4 × 10−80.18
Genetic correlation of lifestyle-related traits with Northing and Easting Cartesian co-ordinates for birth location of unrelated white British individuals from the UK Biobank. We selected twelve traits as representative of obesity and fat distribution, lung function and smoking and blood pressure and hypertension. The tips of the arrows correspond to the genetic correlation of the trait with Northings and Eastings. A more northerly (and westerly) birth location was genetically correlated with increased body-mass index and fat mass, hypertension and smoking and with decreased lung function.
Figure 2

Genetic correlation of lifestyle-related traits with Northing and Easting Cartesian co-ordinates for birth location of unrelated white British individuals from the UK Biobank. We selected twelve traits as representative of obesity and fat distribution, lung function and smoking and blood pressure and hypertension. The tips of the arrows correspond to the genetic correlation of the trait with Northings and Eastings. A more northerly (and westerly) birth location was genetically correlated with increased body-mass index and fat mass, hypertension and smoking and with decreased lung function.

To further investigate the consequences of this residual confounding, we considered BMI and forced vital capacity (FVC, a measure of lung function) as representative of lifestyle-related traits that are genetically correlated with birth location (Materials and Methods). For both traits, the LD-score regression intercepts obtained from 368 325 unrelated white British individuals after inclusion of a random effect for the GRM in the linear regression model indicated evidence of residual population structure that has not been accounted for in the analysis: λBMI = 1.155 and λFVC = 1.099. In contrast, when we considered asthma, a disease that is characterized by poor lung function, but that did not demonstrate significant genetic correlation with birth location (P = 0.70 for Northings), the impact of residual population structure was much less pronounced: λASTHMA = 1.059.

Previous studies have highlighted that genome-wide inflation in GWAS of complex human traits after inclusion of a random effect for the GRM in the linear regression model can reflect environmental factors that are confounded with geography (24), which can better be controlled for through adjustment for axes of genetic variation from PCA (25). We hypothesized that we could account for this residual confounding of BMI and FVC by adjusting for ten axes of genetic variation, Northings and Eastings as covariates in the linear mixed model, in addition to a random effect for the GRM (Materials and Methods). We demonstrated that these additional adjustments only marginally reduced the LD-score regression intercept for both traits: λBMI = 1.140 and λFVC = 1.095 (Fig. 3). The same adjustments also had no impact on the LD-score regression intercept for asthma: λASTHMA = 1.057. Genome-wide, adjustment for Northings and Eastings as covariates in the linear regression model, in addition to the ten axes of genetic variation, did not have a major impact on allelic effect estimates and association P-values (Supplementary Material, Fig. S7).

LD-score regression intercepts for BMI, FVC and asthma, obtained for unrelated white British individuals from the UK Biobank after correction for population structure through inclusion of a random effect for the GRM in a linear mixed model, with and without adjustment for ten axes of genetic variation and Northing and Easting Cartesian coordinates. The height of each bar represents the LD-score intercept, and the error bars define the 95% confidence interval.
Figure 3

LD-score regression intercepts for BMI, FVC and asthma, obtained for unrelated white British individuals from the UK Biobank after correction for population structure through inclusion of a random effect for the GRM in a linear mixed model, with and without adjustment for ten axes of genetic variation and Northing and Easting Cartesian coordinates. The height of each bar represents the LD-score intercept, and the error bars define the 95% confidence interval.

We also investigated the possibility that current residence would better reflect ongoing exposure to environmental factors that are confounded with geography than would birth location. We repeated our analyses of BMI, FVC and asthma, after adjustment for Northings and Eastings derived from current residence postcode, but this did not substantially reduce the genome-wide inflation, compared to birth location, for any of these traits: λBMI = 1.150, λFVC = 1.096 and λASTHMA = 1.054.

Locus-specific impact of residual confounding of birth location with lifestyle-related traits in the UK Biobank

We next investigated the locus-specific impact of residual confounding of birth location with the 41 (mostly lifestyle-related) traits that were genetically correlated with Northings. To do this, we considered the 74 loci attaining genome-wide significant evidence of association (P < 5 × 10−8) with Northings after inclusion of a random effect for the GRM. We first dissected association signals for Northings at each locus through approximate conditional analyses implemented in GCTA (26), making use of 5000 randomly selected white British individuals from UK Biobank as a reference for linkage disequilibrium. We identified 115 distinct association signals attaining locus-wide significance (P < 10−5) for Northings, including six mapping to the major histocompatibility complex (MHC) (Supplementary Material, Table S2). Index variants for 59 (51.3%) of these signals were of low frequency (MAF < 5%), which would be expected to have arisen due to more recent mutation events and hence be more likely to be confounded with geography (Supplementary Material, Fig. S8).

For each distinct association signal, we then identified ‘high-confidence’ variants accounting for at least 5% of the posterior probability of driving confounding with Northings (Materials and Methods). We interrogated each high-confidence variant for association with the 41 traits demonstrating significant genetic correlation with Northings in the UK Biobank. We utilized published association summary statistics available from PhenoScanner (27,28), obtained from analysis of 337 199 unrelated white British individuals in a linear regression model with adjustment for the first ten centrally derived axes of genetic variation from PCA as covariates (Materials and Methods). High-confidence variants driving distinct signals for Northings at five loci were associated, at genome-wide significance, with at least one trait (Supplementary Material, Table S3).

At the LCT locus, two high-confidence variants (rs182549 and rs309137, together accounting for 66.5% of the posterior probability of driving the confounding with Northings) were associated (at genome-wide significance) with 16 of the 41 traits that were genetically correlated with birth location. The Northing increasing alleles at the two variants were associated with increased BMI and multiple measures of fat mass, and with decreased lung function (FVC and forced expiratory volume in 1-second), which are concordant with the direction of the genetic correlation with birth location. Adjustment for ten axes of genetic variation, Northings and Eastings as covariates in the linear mixed model, in addition to a random effect for the GRM, reduced the strength of association with these traits by an order of magnitude across the locus, reflecting correction for residual confounding with birth location (Supplementary Material, Fig. S9). There was a more noticeable impact on the association with BMI, where the estimated allelic effect of rs182549 increased four-fold after adjustment (Supplementary Material, Table S4). These results indicate the potential bias in allelic effect estimates on complex traits that could arise with inadequate correction for population structure in UK Biobank.

At the MHC, where population structure reflects strong selective pressure of infectious diseases in recent human history (22), one high-confidence variant (rs9268556, 13.2% posterior probability of driving the confounding with Northings) was associated (at genome-wide significance) with FVC. In contrast to the signal at the LCT locus, the Northing increasing allele was associated with increased FVC, which is discordant with the direction of the genetic correlation with birth location. Consequently, adjustment for ten axes of genetic variation, Northings and Eastings as covariates in the linear mixed model, in addition to a random effect for the GRM, did not noticeably reduce the strength of association with lung function at this locus (Supplementary Material, Table S4).

Discussion

We have demonstrated that fine-scale population structure in the UK Biobank cannot be fully accounted for through adjustment for centrally derived axes of genetic variation or inclusion of a random effect for the GRM. There was substantial inflation in genome-wide association with Northing and Easting cartesian coordinates that were derived from birth location, even after inclusion of a random effect for the GRM in the linear regression model. The inflation was greater for Northings than for Eastings, which may reflect greater variation in latitude than longitude for participants in the UK Biobank. Investigations previously undertaken with GWAS from the People of the British Isles collection indicated that major clusters separate from North to South, which could reflect major historical events in the peopling of the British Isles (3). These results are consistent with observations across the wider European continent, where the first axis of genetic variation, which correlates with North-South geography, explains more variability in allele frequencies than the second axis, which correlates with East-West geography (29). Bivariate analysis of Northings and Eastings, taking account of the correlation between longitude/latitude of birth location, might provide additional insight into population structure. However, further methodological development and software is required to implement bivariate linear mixed models that can accommodate the scale of GWAS in the UK Biobank.

After correction for population structure, we have observed significant genetic correlation of Northings with 41 traits, most of which are related to lifestyle, including BMI and fat mass, alcohol consumption, hypertension, and smoking and lung function. LD-score regression intercepts for two exemplar lifestyle-related traits, BMI and FVC, indicated evidence of residual population structure that has not been accounted for by the inclusion of a random effect for the GRM in the linear regression model. Such inflation could reflect environmental factors that are confounded with geography, such as diet and smoking habits, which can better be controlled for through adjustment for axes of genetic variation. However, adjustment for ten axes of genetic variation, in addition to Eastings and Northings derived from birth location or current residence, did not substantially reduce the inflation. These results suggest that simple modelling of birth location (or current residence) and/or axes of genetic variation does not capture the full extent of geographical confounding with these environmental influences on lifestyle-related traits. More complex models, for example that allow for non-linear relationships with geography, may offer improved control for confounding with environmental risk factors, but cannot be easily accommodated in computationally efficient software that can be applied to the scale of GWAS in the UK Biobank.

We identified 74 loci that demonstrated significant residual association with Northings after inclusion of a random effect for the GRM in the linear regression model, which map to/near genes that have been subject to selection, including LCT and the MHC region. High-confidence variants driving distinct residual associations for Northings were also strongly associated with many of the lifestyle-related traits that are genetically correlated with birth location, even after correction for population structure. These signals could, therefore, represent false positive associations with lifestyle-related traits that are driven by confounding with geography. At signals for which the high-confidence variant was also associated with the lifestyle-related trait in the direction predicted by the genetic correlation, such as for BMI and FVC at the LCT locus, additional adjustment for axes of genetic variation and birth location as covariates reduced the strength of the association. In contrast, when the association with the lifestyle-related trait was in the opposite direction to that predicted by the genetic correlation, for example for FVC in the MHC region, adjustment for axes of genetic variation and birth location as covariates had no impact on the signal. Thus, while adjustment for axes of genetic variation and birth location, in addition to a random effect for the GRM, did not substantially reduce the inflation in association with lifestyle-related traits genome-wide, we did observe locus-specific differences in the impact of this correction, which reflect varying levels of confounding with geography.

In conclusion, our findings highlight the need for caution in the interpretation of GWAS of lifestyle-related health outcomes in UK Biobank, particularly in loci demonstrating strong residual association with birth location, even after adjustment for population structure. To minimize the impact of population structure on these traits at loci that are most strongly confounded with geography, we recommend adjusting for axes of genetic variation and birth location, in addition to a random effect for the GRM in a regression model. Where substantial residual inflation in the genome-wide association remains, for example an LD-score intercept of the order of 1.1 or more, we suggest careful consideration of potential environmental risk factors for the trait that could have more complex confounding with geography than can be accommodated by simple linear relationships with birth location (or current residence). UK Biobank has collected extensive questionnaire data on diet, smoking, alcohol consumption and exercise, and these potential confounders can be included directly as covariates in a regression model, without any assumptions about their correlation with geography. Further studies are warranted in other large-scale biobanks, particularly in less homogenous populations where the impact of geographical confounding of allele frequencies on complex trait GWAS may be even more pronounced.

Materials and Methods

Selection of participants from UK Biobank

We utilized the subset of ‘white British’ individuals identified centrally by the UK Biobank Analysis Team (15), based on self-reported ethnicity from the assessment center questionnaire and axes of genetic variation from PCA. We then utilized the relatedness report generated by the UK Biobank Analysis Team (15) to retain the maximal set of unrelated participants, which corresponded to a maximum kinship coefficient of 0.0884.

We interrogated demographic data of reported birth location, for which UK postcodes had been converted to Easting and Northing Cartesian coordinates, rounded to the nearest 500 m, relative to an origin in the South West of the British Isles (Supplementary Material, Fig. S2). We excluded individuals with missing birth location and those from the pilot study at the Stockport recruitment center for which the Cartesian coordinates were incorrect. For some sensitivity analyses, we also considered Easting and Northing Cartesian coordinates derived from current residence postcode.

Genome-wide association analyses with Cartesian coordinates of birth location in UK Biobank

The UK Biobank Central Analysis Team performed initial quality control of variants, and imputation up to reference panels from the 1000 Genomes Project (16), UK10K Project (17) and Haplotype Reference Consortium (18). We considered the subset of variants that were imputed to the Haplotype Reference Consortium, excluding those with MAF < 0.5% and/or imputation quality info score < 0.5. For each variant passing quality control, we tested for association with Northings and Eastings, separately, in a linear regression model, using the genotype dosage from imputation and including only genotyping array (UK Biobank or UK BiLEVE) as a covariate, as implemented in SNPTESTv2.5.2 (19). We used two approaches to account for population structure. First, we included ten centrally derived axes of genetic variation from PCA, in addition to genotyping array, as covariates as implemented in SNPTESTv2.5.2 (19). We also performed sensitivity analyses including twenty centrally derived axes of genetic variation from PCA. Second, we included a random effect for the GRM, in addition to a fixed effect for genotyping array, as implemented in BOLT-LMMv2.3 (13). We followed recommendations from the BOLT-LMM UK Biobank analysis pipeline: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/#x1-510009. The GRM was constructed from directly genotyped variants that passed initial quality control from the UK Biobank Central Analysis Team. BOLT-LMM performs ‘leave-one-chromosome-out’ analysis: variants from the chromosome being tested for association are excluded from the GRM to avoid proximal contamination (13).

For each analysis, to assess inflation in association signals due to residual population structure that was not accounted for in the analysis, we calculated the intercept from LD-score regression (20), using a subset of approximately one million variants for which European ancestry LD-scores were available. For some sensitivity analyses, we separated directly genotyped and imputed variants and calculated the LD-score intercept for each set.

Genetic correlation of birth location with complex human traits in the UK Biobank

We used LD-score regression (21) to assess the genome-wide genetic correlation between birth location and selected traits available in the UK Biobank. We utilized published association summary statistics available from LD-Hub (23), obtained from analysis of 337 199 unrelated white British participants passing central quality control in a generalized linear regression model with adjustment for sex and the first ten centrally derived axes of genetic variation from PCA as covariates, as implemented in Hail. Phenotypes were derived and harmonized with PHESANT (30), and association analyses were restricted to variants with MAF > 0.1%, exact Hardy–Weinberg equilibrium (HWE) P > 10−10 and imputation quality info score > 0.8. Full details of the quality control, phenotype derivation and association analyses can be found at: https://github.com/Nealelab/UK_Biobank_GWAS/tree/master/imputed-v2-gwas#sample-and-variant-qc.

Of the 597 traits reported from UK Biobank in LD-Hub, we excluded those that were not directly related to health outcomes, lifestyle and/or anthropometric measures (such as current employment, diseases of family members, education, and medication). For each the remaining 268 traits, we calculated the genetic correlation with Northings and Eastings using association summary statistics after adjusting for population structure by including a random effect for the GRM as implemented in BOLT-LMMv2.3 (13), as described above. The LD-score regression analysis was restricted to a subset of approximately one million variants for which European ancestry LD-scores were available and which overlapped with those reported for birth location and the complex trait. We extracted the genetic correlation, corresponding standard error and P-value. We defined significant genetic correlation by P < 0.00019, which corresponded to a Bonferroni correction for 268 traits.

Genome-wide association analyses with BMI, FVC and asthma in UK Biobank

We performed inverse rank normalization of BMI and FVC (best measure). For each variant passing quality control, we tested for association with each trait (after transformation), separately, in a linear regression model, using the genotype dosage from imputation, and including genotyping array (UK Biobank or UK BiLEVE) as a covariate and a random effect for the GRM as implemented in BOLT-LMMv2.3 (13). We repeated each of these analyses by including (i) the first ten centrally derived axes of genetic variation from PCA as additional covariates and (ii) the first ten centrally derived axes of genetic variation from PCA, Northings and Eastings as additional covariates in the linear regression model. We also repeated our analyses, adjusting for Northings and Eastings derived from current location postcode, instead of birth location postcode, in addition the first ten centrally derived axes of genetic variation from PCA. For each trait, for each analysis, we calculated the intercept from LD-score regression (20), using a subset of approximately one million variants for which European ancestry LD-scores were available.

Dissection and fine-mapping of association signals with birth location in UK Biobank

We considered each locus attaining genome-wide significant evidence of association (P < 5 × 10−8) with Northings. Within each locus, we utilized the “--cojo-slct” option in GCTA (26) to identify index variants representing distinct association signals attaining locus-wide significance (P < 10−5), based on (i) association summary statistics for Northings after adjusting for population structure by including a random effect for the GRM as implemented in BOLT-LMMv2.3 (13), as described above and (ii) 5000 randomly selected white British participants included in our association analyses as a reference for LD in the UK population. For each locus with more than one index variant, we next dissected each distinct association signal. For each index variant, we obtained the corresponding conditional association signal by utilizing the “--cojo-cond” option in GCTA (26) by adjusting for all other index variants at the locus.

Within each locus, for each distinct signal, we first approximated the Bayes factor (31) in favor of association with Northings of each variant on the basis of summary statistics after adjusting for population structure by including a random effect for the GRM as implemented in BOLT-LMMv2.3 (13), as described above. We utilized summary statistics from unconditional analysis for loci with a single signal, and GCTA conditional analysis for loci with multiple distinct signals. Specifically, the Bayes factor for the |$j$|th variant at the |$i$|th distinct association signal is approximated by
where |${b}_{ij}$| and |${v}_{ij}$| are the allelic effect on Northings and the corresponding variance, respectively. We then calculated the posterior probability that the |$j$|th variant is driving the |$i$|th distinct association, given by
where the summation is over all variants across the locus. We defined ‘high-confidence’ variants as having posterior probability of at least 5% of driving distinct association signals for birth location.

Association of high-confidence variants with complex human traits in UK Biobank

We extracted association summary statistics for each high-confidence variant for each trait attaining significant genetic correlation with Northings in UK Biobank using PhenoScanner (27,28). Association summary statistics were obtained from analysis of 337 199 unrelated white British participants passing central quality control in a generalized linear regression model with adjustment for sex and the first ten centrally derived axes of genetic variation from PCA as covariates, as implemented in Hail. Phenotypes were derived and harmonized with PHESANT (30), and association analyses were restricted to variants with MAF > 0.1%, exact HWE P > 10−10, and imputation quality info score > 0.8.

Acknowledgements

This study has been conducted using the UK Biobank resource (project number 15390).

Conflict of Interest statement. As of January 2020, A.M. is an employee of Genentech and a holder of Roche stock.

References

1.

Wellcome Trust Case Control Consortium
. (
2007
)
Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls
.
Nature
,
447
,
661
678
.

2.

O'Dushlaine
,
C.T.
,
Morris
,
D.
,
Moskvina
,
V.
,
Kirov
,
G.
,
International Schizophrenia Consortium
,
Gill
,
M.
,
Corvin
,
A.
,
Wilson
,
J.F.
and
Cavalleri
,
G.L.
(
2010
)
Population structure and genome-wide patterns of variation in Ireland and Britain
.
Eur. J. Hum. Genet.
,
18
,
1248
1254
.

3.

Leslie
,
S.
,
Winney
,
B.
,
Hellenthal
,
G.
,
Davison
,
D.
,
Boumertit
,
A.
,
Day
,
T.
,
Hutnik
,
K.
,
Royrvik
,
E.C.
,
Cunliffe
,
B.
,
Wellcome Trust Case Control Consortium 2
 et al. (
2015
)
The fine-scale genetic structure of the British population
.
Nature
,
519
,
309
314
.

4.

Heath
,
S.C.
,
Gut
,
I.G.
,
Brennan
,
P.
,
McKay
,
J.D.
,
Bencko
,
V.
,
Fabianova
,
E.
,
Foretova
,
L.
,
Georges
,
M.
,
Janout
,
V.
,
Kabesch
,
M.
 et al. (
2008
)
Investigation of the fine structure of European populations with applications to disease association studies
.
Eur. J. Hum. Genet.
,
16
,
1413
1429
.

5.

Price
,
A.L.
,
Patterson
,
N.J.
,
Plenge
,
R.M.
,
Weinblatt
,
M.E.
,
Shadick
,
N.A.
and
Reich
,
D.
(
2006
)
Principal components analysis corrects for stratification in genome-wide association studies
.
Nat. Genet.
,
38
,
904
909
.

6.

Kang
,
H.M.
,
Sul
,
J.H.
,
Service, S.K
,
Zaitlen
,
N.A.
,
Kong
,
S.Y.
,
Freimer
,
N.B.
,
Sabatti
,
C.
and
Eskin
,
E.
(
2010
)
Variance component model to account for sample structure in genome-wide association studies
.
Nat. Genet.
,
42
,
348
354
.

7.

Zhang
,
Z.
,
Ersoz
,
E.
,
Lai
,
C.Q.
,
Todhunter
,
R.J.
,
Tiwari
,
H.K.
,
Gore
,
M.A.
,
Bradbury
,
P.J.
,
Yu
,
J.
,
Arnett
,
D.K.
,
Ordovas
,
J.M.
 et al. (
2010
)
Mixed linear model approach adapted for genome-wide association studies
.
Nat. Genet.
,
42
,
355
360
.

8.

Price
,
A.L.
,
Zaitlen
,
N.A.
,
Reich
,
D.
and
Patterson
,
N.
(
2010
)
New approaches to population stratification in genome-wide association studies
.
Nat. Rev. Genet.
,
11
,
459
463
.

9.

Lippert
,
C.
,
Listgarten
,
J.
,
Liu
,
Y.
,
Kadie
,
C.M.
,
Davidson
,
R.I.
and
Heckerman
,
D.
(
2011
)
FaST linear mixed models for genome-wide association studies
.
Nat. Methods
,
8
,
833
835
.

10.

Listgarten
,
J.
,
Lippert
,
C.
,
Kadie
,
C.M.
,
Davidson
,
R.I.
,
Eskin
,
E.
and
Heckerman
,
D.
(
2012
)
Improved linear mixed models for genome-wide association studies
.
Nat. Methods
,
9
,
525
526
.

11.

Zhou
,
X.
and
Stephens
,
M.
(
2012
)
Genome-wide efficient mixed-model analysis for association studies
.
Nat. Genet.
,
44
,
821
824
.

12.

Svishcheva
,
G.R.
,
Axenovich
,
T.I.
,
Belonogova
,
N.M.
,
van
 
Duijn
,
C.M.
and
Aulchenko
,
Y.S.
(
2012
)
Rapid variance components-based method for whole-genome association analysis
.
Nat. Genet.
,
44
,
1166
1170
.

13.

Loh
,
P.R.
,
Tucker
,
G.
,
Bulik-Sullivan
,
B.K.
,
Vilhjálmsson
,
B.J.
,
Finucane
,
H.K.
,
Salem
,
R.M.
,
Chasman
,
D.I.
,
Ridker
,
P.M.
,
Neale
,
B.M.
,
Berger
,
B.
 et al. (
2015
)
Efficient Bayesian mixed model analysis increases association power in large cohorts
.
Nat. Genet.
,
47
,
284
290
.

14.

Sudlow
,
C.
,
Gallacher
,
J.
,
Allen
,
N.
,
Beral
,
V.
,
Burton
,
P.
,
Danesh
,
J.
,
Downey
,
P.
,
Elliott
,
P.
,
Green
,
J.
,
Landray
,
M.
 et al. (
2015
)
UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age
.
PLoS Med.
,
12
,
e1001779
.

15.

Bycroft
,
C.
,
Freeman
,
C.
,
Petkova
,
D.
,
Band
,
G.
,
Elliott
,
L.T.
,
Sharp
,
K.
,
Motyer
,
A.
,
Vukcevic
,
D.
,
Delaneau
,
O.
,
O'Connell
,
J.
 et al. (
2018
)
The UK biobank resource with deep phenotyping and genomic data
.
Nature
,
562
,
203
209
.

16.

1000 Genomes Project Consortium
(
2015
)
A global reference for human genetic variation
.
Nature
,
526
,
68
74
.

17.

UK10K Consortium
(
2015
)
The UK10K project identifies rare variants in health and disease
.
Nature
,
526
,
82
90
.

18.

McCarthy
,
S.
,
Das
,
S.
,
Kretzschmar
,
W.
,
Delaneau
,
O.
,
Wood
,
A.R.
,
Teumer
,
A.
,
Kang
,
H.M.
,
Fuchsberger
,
C.
,
Danecek
,
P.
,
Sharp
,
K.
 et al. (
2016
)
A reference panel of 64,976 haplotypes for genotype imputation
.
Nat. Genet.
,
48
,
1279
1283
.

19.

Marchini
,
J.
and
Howie
,
B.
(
2010
)
Genotype imputation for genome-wide association studies
.
Nat. Rev. Genet.
,
11
,
499
511
.

20.

Bulik-Sullivan
,
B.K.
,
Loh
,
P.R.
,
Finucane
,
H.K.
,
Ripke
,
S.
,
Yang
,
J.
,
Schizophrenia Working Group of the Psychiatric Genomics Consortium
,
Patterson
,
N.
,
Daly
,
M.J.
,
Price
,
A.L.
and
Neale
,
B.M.
(
2015
)
LD score regression distinguishes confounding from polygenicity in genome-wide association studies
.
Nat. Genet.
,
47
,
291
295
.

21.

Bulik-Sullivan
,
B.K.
,
Finucane
,
H.K.
,
Antilla
,
V.
,
Gusev
,
A.
,
Day
,
F.R.
,
Loh
,
P.R.
,
ReproGen Consortium, Psychiatric Genetics Consortium, Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3
 et al. (
2015
)
An atlas of genetic correlation across human diseases and traits
.
Nat. Genet.
,
47
,
1236
1241
.

22.

Sabeti
,
P.C.
,
Schaffner
,
S.F.
,
Fry
,
B.
,
Lohmueller
,
J.
,
Varilly
,
P.
,
Shamovsky
,
O.
,
Palma
,
A.
,
Mikkelsen
,
T.S.
,
Altshuler
,
D.
and
Lander
,
E.S.
(
2006
)
Positive natural selection in the human lineage
.
Science
,
312
,
1614
1620
.

23.

Zheng
,
J.
,
Erzurumluoglu
,
A.M.
,
Elsworth
,
B.L.
,
Kemp
,
J.P.
,
Howe
,
L.
,
Haycock
,
P.C.
,
Hemani
,
G.
,
Tansey
,
K.
,
Laurin
,
C.
,
Early Genetics and Lifecourse Epidemiology (EAGLE) Eczema Consortium
 et al. (
2017
)
LD hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis
.
Bioinformatics
,
33
,
272
279
.

24.

Haworth
,
S.
,
Mitchell
,
R.
,
Corbin
,
L.
,
Wade
,
K.H.
,
Dudding
,
T.
,
Budu-Aggrey
,
A.
,
Carslake
,
D.
,
Hemani
,
G.
,
Paternoster
,
L.
,
Smith
,
G.D.
 et al. (
2019
)
Apparent latent structure within the UK biobank sample has implications for epidemiological analysis
.
Nat. Commun.
,
10
,
333
.

25.

Zhang
,
Y.
and
Pan
,
W.
(
2015
)
Principal component regression and linear mixed model in association analysis of structured samples: competitors or complements?
 
Genet. Epidemiol.
,
39
,
149
155
.

26.

Yang
,
J.
,
Ferreira
,
T.
,
Morris
,
A.P.
,
Medland
,
S.E.
,
Genetic Investigation of ANthropometric Traits (GIANT) Consortium, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium
,
Madden
,
P.A.
,
Heath
,
A.C.
,
Martin
,
N.G.
,
Montgomery
,
G.W.
 et al. (
2012
)
Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits
.
Nat. Genet.
,
44
,
369
375
.

27.

Staley
,
J.R.
,
Blackshaw
,
J.
,
Kamat
,
M.A.
,
Ellis
,
S.
,
Surendran
,
P.
,
Sun
,
B.B.
,
Paul
,
D.S.
,
Freitag
,
D.
,
Burgess
,
S.
,
Danesh
,
J.
 et al. (
2016
)
PhenoScanner: a database of human genotype-phenotype associations
.
Bioinformatics
,
32
,
3207
3209
.

28.

Kamat
,
M.A.
,
Blackshaw
,
J.A.
,
Young
,
R.
,
Surendran
,
P.
,
Burgess
,
S.
,
Danesh
,
J.
,
Butterworth
,
A.S.
and
Staley
,
J.R.
(
2019
)
PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations
.
Bioinformatics
,
35
,
4851
4853
.

29.

Novembre
,
J.
,
Johnson
,
T.
,
Bryc
,
K.
,
Kutalik
,
Z.
,
Boyko
,
A.R.
,
Auton
,
A.
,
Indap
,
A.
,
King
,
K.S.
,
Bergmann
,
S.
,
Nelson
,
M.R.
 et al. (
2008
)
Genes mirror geography within Europe
.
Nature
,
456
,
98
101
.

30.

Millard
,
L.A.C.
,
Davies
,
N.M.
,
Gaunt
,
T.R.
,
Davey Smith
,
G.
and
Tilling
,
K.
(
2018
)
PHESANT: a tool for performing automated phenome scans in UK biobank
.
Int. J. Epidemiol.
,
47
,
29
35
.

31.

Schwarz
,
G.
(
1978
)
Estimating the dimension of a model
.
Ann. Stat.
,
6
,
461
464
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data