Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants With Those of the General Population

Abstract The UK Biobank cohort is a population-based cohort of 500,000 participants recruited in the United Kingdom (UK) between 2006 and 2010. Approximately 9.2 million individuals aged 40–69 years who lived within 25 miles (40 km) of one of 22 assessment centers in England, Wales, and Scotland were invited to enter the cohort, and 5.5% participated in the baseline assessment. The representativeness of the UK Biobank cohort was investigated by comparing demographic characteristics between nonresponders and responders. Sociodemographic, physical, lifestyle, and health-related characteristics of the cohort were compared with nationally representative data sources. UK Biobank participants were more likely to be older, to be female, and to live in less socioeconomically deprived areas than nonparticipants. Compared with the general population, participants were less likely to be obese, to smoke, and to drink alcohol on a daily basis and had fewer self-reported health conditions. At age 70–74 years, rates of all-cause mortality and total cancer incidence were 46.2% and 11.8% lower, respectively, in men and 55.5% and 18.1% lower, respectively, in women than in the general population of the same age. UK Biobank is not representative of the sampling population; there is evidence of a “healthy volunteer” selection bias. Nonetheless, valid assessment of exposure-disease relationships may be widely generalizable and does not require participants to be representative of the population at large.

The UK Biobank Study is a large prospective cohort study, established primarily to investigate the genetic and lifestyle determinants of a wide range of diseases of middle and later life (1). This open-access resource involves 500,000 United Kingdom (UK) men and women who were aged 40-69 years when recruited throughout England, Wales, and Scotland between 2006 and 2010. Extensive questionnaire data, physical measurements, and biological samples were collected at recruitment, and there is ongoing enhanced data collection in large subsets of the cohort, including a repeat baseline assessment, genotyping, biochemical assays, Web-based questionnaires, physical activity monitoring, and multimodal imaging. All participants are followed up for health conditions through linkage to national electronic healthrelated data sets.
Our aim in the current study was to examine and quantify whether the UK Biobank cohort differed from the sampling frame with regard to a range of characteristics due to the "healthy volunteer effect" (2), whereby people who volunteer for research studies tend to be, on average, more health-conscious than nonparticipants (3). To investigate this, we compared the distributions of a range of sociodemographic, physical, lifestyle, and health-related characteristics between UK Biobank participants and 1) persons invited to join UK Biobank and 2) respondents to nationally representative surveys.

METHODS
UK Biobank investigators sent postal invitations to 9,238,453 individuals registered with the UK's National Health Service who were aged 40-69 years and lived within approximately 25 miles (40 km) of one of 22 assessment centers located throughout England, Wales, and Scotland. The National Information Governance Board for Health and Social Care and the North West Multicentre Research Ethics Committee provided approval for UK Biobank to obtain the contact details of people within the eligible age range from local National Health Service Primary Care Trusts. UK Biobank also received approval to retain limited information on nonresponders. Overall, 503,317 participants consented to join the study cohort and visited an assessment center between 2006 and 2010, resulting in a participation rate of 5.45% (see Web Figure 1, available at https:// academic.oup.com/aje, for a flow chart demonstrating responses to invitations).
Anonymized data on sex, month, and year of birth, Townsend deprivation index (an indicator of socioeconomic status), and geographic location are stored in the UK Biobank resource and were available for 8,761,869 of the 9,238,453 (94.8%) individuals sent an invitation letter, allowing us to compare the distributions of these characteristics between nonparticipating invitees and participants. The distributions of a range of sociodemographic, physical, lifestyle, and healthrelated characteristics of the UK Biobank cohort were also compared with publicly available summary data from nationally representative population-based surveys and the UK Census. We selected summary survey data that matched the UK Biobank cohort as closely as possible with regard to population demographic factors (i.e., both sexes and ages 40-69 years) and the period of data collection (2006)(2007)(2008)(2009)(2010). Where certain characteristics from the national survey summary data were only available in prespecified aggregated age and sex subgroups, UK Biobank data were stratified into similar groups for comparative purposes. Formal statistical tests of the difference in characteristics between UK Biobank and national data were not performed because of the lack of variance measures required to test for differences between means, such as standard deviations, from the comparison populations.
The UK Census collects individual and household-level demographic data every 10 years for the whole UK population. Data on ethnicity were obtained from the 2001 and 2011 UK Census for England, Wales, and Scotland, as these reflected the census years falling before and immediately after the recruitment period (4,5). Data on property ownership status were obtained from the 2001 UK Census for England and Wales only, since 2011 UK Census data on property ownership were not available for the appropriate age groups. Data on anthropometric measures, smoking status, alcohol consumption, and prevalences of selfreported health conditions were obtained from the Health Survey for England (HSE) for the years 2006, 2008, 2009, and 2010 (6-9). The HSE consists of an annual cross-sectional survey of a small (n = approximately 5,000-15,000), representative population of England through a 2-stage random probability sampling process, with information on different data items being collected in a different population each year (10,11). Since 2003, the HSE has incorporated weighting to account for nonresponse bias (12). This includes different weights for nonresponding households, nonresponding individuals in responding households, and nonresponse at different stages of data collection. For a detailed description of the data collection methods used in UK Biobank and national surveys, see Web Table 1.
Age-and sex-specific data on all-cause mortality and cancer incidence rates for England were obtained from the Office for National Statistics for 2012, as this date represented the midpoint of the follow-up period for UK Biobank participants (13,14). For all-cause mortality, follow-up time (person-years) in the UK Biobank cohort was calculated as the period ranging from age at recruitment to age at death or the date of complete follow-up (November 30, 2015), whichever came first; for cancer incidence rates, follow-up time was defined as the period ranging from age at recruitment to age at first cancer diagnosis, death, or the date of complete follow-up (September 30, 2014), whichever came first (among persons with no cancer at recruitment, based on cancer registry data). Cancer incidence rates were calculated for total cancer (excluding nonmelanoma skin cancer), defined using International Classification of Diseases, Tenth Revision (ICD-10), codes C00-C97 (excluding code C44), and common types-prostate (ICD-10 code C61), breast (ICD-10 code C50), colorectal (ICD-10 codes C18-C20), lung (ICD-10 codes C33-C34), endometrium (ICD-10 code C54), and kidney (ICD-10 code C64).
The UK Biobank Study received approval from the National Information Governance Board for Health and Social Care and the National Health Service North West Multicentre Research Ethics Committee.

Characteristics of UK Biobank participants versus nonparticipating invitees
Of the 9,238,453 men and women invited to join UK Biobank, 503,317 (5.45%) consented and were recruited between 2006 and 2010. Overall, the participation rate was higher in women (participation rates were 6.4% and 5.1% in women and men, respectively) ( Figure 1A), in older age groups (9% in those aged ≥60 years and 3% in those aged 40-44 years) ( Figure 1B), and in less socioeconomically deprived areas (8.3% among persons from the least deprived areas and 3.1% among persons from the most deprived areas) ( Figure 1C). Participation rates showed regional differences, being highest in South West England (9.6%) and East Scotland (8.2%) and lowest in West Scotland (4.3%), London, the West Midlands, and North West England (all 4.7%) ( Figure 1D; also see Web Table 2 for further details).

Characteristics of UK Biobank participants compared with national survey data
Sociodemographic factors. In the UK Biobank cohort, 94.6% of participants were of white ethnicity, which was similar to the national population of the same age range in the 2001 UK Census (94.5%) but somewhat higher than in live in rental accommodations than the general population of the same age range (Table 2).
Physical characteristics. UK Biobank participants were, on average, taller and leaner and had a smaller waist circumference than the general population, based on the HSE 2008 (Table 3). For example, mean body mass index (defined as weight (kg)/height (m) 2 ) in UK Biobank men and women aged 55-64 years was 27.9 and 27.3, respectively, as compared with 28.5 and 28.0 in the general population, based on data from the HSE 2008. UK Biobank men and women were also less likely to be obese (defined as body mass index ≥30) across all age groups examined in comparison with the general population. For example, for men aged 45-54 years, the prevalence of obesity was 25.6% in UK Biobank and 31.5% in the general population, with corresponding values of 23.0% and 32.2%, respectively, for women (Web Table 3).
Lifestyle characteristics. UK Biobank men and women were less likely to be current smokers than the general population across all age groups, based on data from the HSE 2008 ( Figure 2). For example, for men aged 45-54 years, the prevalence of current smoking was 15% in UK Biobank and 22% in the general population; the corresponding values for women were 11% and 20%, respectively. However, younger smokers (aged 45-54 years) in UK Biobank smoked more heavily (≥20 cigarettes/day) than those in the general population (46% and 41%, respectively, for men; 32% and 28%, respectively, for women). This difference persisted for older women aged 55-64 years (31% and 23% in UK Biobank and the general population, respectively) but not for older men (47% and 49%, respectively; Web Figure 2). UK Biobank participants were also less likely to be never drinkers but were less likely to drink alcohol every day than the general population included in the HSE 2008 (Table 4).
Self-reported health conditions. UK Biobank participants had a lower prevalence of self-reported health conditions, including cardiovascular disease, stroke, hypertension, diabetes, chronic kidney disease, and respiratory disease, than the general population, as obtained from various HSEs performed in 2006, 2009, and 2010 (Table 5). For example, among men aged 45-54 years, the prevalence of self-reported cardiovascular disease was 4.6% in UK Biobank participants and 10.9% in the general population, and among women aged 45-54 years the prevalences were 2.4% and 10.3%, respectively.
All-cause mortality and cancer incidence rates. UK Biobank participants were followed up for mean durations of 6.77  Table 1. Participants were assigned a Townsend deprivation score corresponding to the output area of their residential postcode (most deprived: ≥2.00; average: −2.00 to 1.99; least deprived: <−2.00). UK, United Kingdom.
(standard deviation, 1.01) years and 5.53 (standard deviation, 1.10) years for all-cause mortality and incident cancer, respectively. Compared with national death rates among persons aged 70-74 years, all-cause mortality in UK Biobank participants was 46.2% lower in men and 55.5% lower in women ( Figure 3A and 3B; also see Web Table 4 for further details of age-specific mortality rates). The total cancer incidence rate was also lower than in the general population, being 11.8% and 18.1% lower at ages 70-74 years in men and women, respectively (Figure 4A and 4B; also see Web Table 5 for further details of age-specific cancer incidence rates). A similar pattern was observed for cancers of the colorectum, kidney, and endometrium (Web Figure 3). Lung cancer incidence rates in UK Biobank were markedly lower for both men and women, while rates of female breast cancer were similar to the national average, with the exception of women aged 45-49 years, in whom the rate was higher in the UK Biobank cohort. In contrast, prostate cancer incidence was higher in UK Biobank compared with national rates across all age groups examined.   (4) for further information about census data. b Excludes 4,313 UK Biobank participants aged 50-64 years who were missing data on property ownership status or who responded "none of the above" or "prefer not to answer." c Category not included in the UK Biobank questionnaire.

DISCUSSION
The rate of participation in the UK Biobank Study was higher among women, older age groups, and persons living in less socioeconomically deprived areas. UK Biobank participants also differed with regard to several lifestyle and health-related characteristics when compared with the general population of  (9) for further information about HSE data. b HSE estimates were weighted for nonresponse bias. c Excludes 1,013 UK Biobank participants aged 45-64 years who were missing data for alcohol intake or responded "prefer not to answer." d The HSE categories "almost every day" and "5 or 6 days a week" were defined as "daily." e The HSE categories "once every couple of months" and "once or twice in the past year" were defined as "special occasions." f The HSE category "not at all in the last 12 months/nondrinker" was defined as "never."  1,123, n = 1,015, n = 1,141, and n = 1,050, respectively). HSE 2009 estimates were used for hypertension (n = 274, n = 244, n = 280, and n = 253, respectively) and diabetes (n = 391, n = 345, n = 398, and n = 358, respectively). HSE 2010 estimates were used for asthma (n = 720, n = 608, n = 730, and n = 630, respectively) and COPD (n = 720, n = 608, n = 730, and n = 631, respectively). Both 2009 and 2010 estimates (n = 1,112, n = 1,128, n = 953, and n = 989, respectively) were used for chronic kidney disease. d Cardiovascular disease included angina, heart attack, stroke, heart murmur, and irregular heart rhythm. e Ischemic heart disease included heart attack or angina. f HSE estimates were available only to the nearest integer.
the same age. For example, men aged 45-54 years were less likely to be obese (25.6% in UK Biobank vs. 31.5% in the general population) and less likely to be current smokers (15% vs. 22%), with similar findings being observed for women and older age groups. Furthermore, compared with the general population, UK Biobank participants were less likely to drink alcohol on a daily basis and had fewer self-reported health conditions. Linkage of UK Biobank participants with their health records during an average of 6-7 years of follow-up also showed lower rates of all-cause mortality and total cancer incidence than in the general population of the same age. These findings are consistent with the well-established "healthy volunteer" effect, which has been demonstrated in other volunteer-based cohort studies (15)(16)(17). Other prospective studies have also found lower rates of all-cause mortality and incident cancer in comparison with national rates (18)(19)(20)(21). The only examined health condition that had a higher incidence rate in UK Biobank than in the general population was prostate cancer, which might reflect higher rates of voluntary prostate-specific antigen testing (and subsequent prostate cancer diagnosis) among health-conscious men. In contrast, lung cancer incidence rates were markedly lower in UK Biobank across all age and sex groups, almost certainly caused by the lower prevalence of smoking compared with the general population.
Because UK Biobank participants are, on average, more health-conscious than the general population, this cohort is  not the best for estimation of generalizable prevalence or incidence rates of disease (although some health-related characteristics of the UK Biobank cohort, such as the prevalence of self-reported pain, have previously been shown to be similar to those of the national population (22)). In order for a cohort study to produce generalizable associations of exposures with disease, it is important that sufficiently large numbers of individuals with different levels of exposures be investigated with high internal validity (23)(24)(25)(26). Indeed, if one were interested in investigating the association of ethnicity with subsequent disease risk, the most appropriate study design would be to recruit a large number of people from different ethnic backgrounds rather than have a representative, largely white population. Because UK Biobank is primarily designed for investigating exposure-disease associations, the lack of representativeness should not be regarded as a limitation (27,28). As with all observational studies, it is incumbent upon researchers to acknowledge potential sources of bias that might affect the generalizability of exposure-disease associations on a case-bycase basis, such as residual confounding, reverse causation, and self-selection bias (24,29). Although the UK Biobank Study is still in the early stages as a prospective study, initial publications have shown expected associations of cardiometabolic morbidity, self-reported health, and smoking with mortality risk (30,31). This study provides an overview of the representativeness of the UK Biobank cohort with regard to a variety of key characteristics in comparison with the general UK population using data from nationally representative surveys. We expect that these findings will be used by researchers to inform the interpretation of results or, in some instances, to help generate weighted results (e.g., in order to estimate nationally representative disease rates). We were able to compare participation rates for key sociodemographic characteristics (such as age, sex, socioeconomic status, and geographic location) due to the availability of such data for the total sampling frame. The availability of follow-up health data enabled us to compare death and cancer incidence rates with age-and sex-specific national rates, and the large size of the cohort meant that sufficient numbers of cases had accrued to investigate common cancer types. All UK Biobank participants are flagged by national death and cancer registries, and loss to follow-up due to emigration has been minimal (0.3% of the cohort). Further follow-up is required to determine whether this "healthy volunteer effect" attenuates over time (owing to the development of chronic disease as the cohort ages), a phenomenon which has been observed in previous studies (18,20,32).
One limitation of our study is that the national survey data (available from the UK Census and the HSE) were presented in prespecified age groups, thereby restricting the comparisons that could be performed. For the majority of characteristics, comparable national survey data were available only for England, although only 11% of participants were recruited in Wales and Scotland and the distributions of most characteristics were similar across the 3 countries. It is also possible that differences in the wording of questions, answer choices, and data collection methods might have influenced the comparability of certain characteristics between the national surveys and the UK Biobank cohort. For example, the HSE consisted primarily of a verbal interview that enabled the interviewer to probe the participant for further information, whereas data on all of the characteristics of UK Biobank participants presented here were collected via a touchscreen questionnaire, with the exception of information on self-reported health conditions, which was collected through a verbal interview with a trained nurse.
In conclusion, the UK Biobank cohort is not representative of the general population with regard to a number of sociodemographic, physical, lifestyle, and health-related characteristics. UK Biobank participants generally live in less socioeconomically deprived areas; are less likely to be obese, to smoke, and to drink alcohol on a daily basis; and have fewer self-reported health conditions. All-cause mortality is approximately half that of the UK population as a whole, and total cancer incidence rates are approximately 10%-20% lower. Although UK Biobank is not suitable for deriving generalizable disease prevalence and incidence rates, its large size and heterogeneity of exposure measures provide valid scientific inferences of associations between exposures and health conditions that are generalizable to other populations.