The landscape of health disparities in the UK Biobank

The UK Biobank (UKB), a large-scale biomedical database that includes demographic and electronic health record data for more than half a million ethnically diverse participants, is a potentially valuable resource for the study of health disparities. However, publicly accessible databases that catalog health disparities in the UKB do not exist. We developed the UKB Health Disparities Browser with the aims of (i) facilitating the exploration of the landscape of health disparities in the UK and (ii) directing the attention to areas of disparities research that might have the greatest public health impact. Health disparities were characterized for UKB participant groups defined by age, country of residence, ethnic group, sex and socioeconomic deprivation. We defined disease cohorts for UKB participants by mapping participant International Classification of Diseases, Tenth Revision (ICD-10) diagnosis codes to phenotype codes (phecodes). For each of the population attributes used to define population groups, disease percent prevalence values were computed for all groups from phecode case–control cohorts, and the magnitude of the disparities was calculated by both the difference and ratio of the range of disease prevalence values among groups to identify high-and low-prevalence disparities. We identified numerous diseases and health conditions with disparate prevalence values across population attributes, and we deployed an interactive web browser to visualize the results of our analysis: https://ukbatlas.health-disparities.org. The interactive browser includes overall and group-specific prevalence data for 1513 diseases based on a cohort of >500000 participants from the UKB. Researchers can browse and sort by disease prevalence and prevalence differences to visualize health disparities for each of the five population attributes, and users can search for diseases of interest by disease names or codes.


Introduction
Health disparities can be defined in the most straightforward way as differences in health outcomes between groups of people, where the groups can be delineated in a variety of ways. The etiology of these differences in outcomes is multifactorial, with contributions from a combination of biological (genetic), social and environmental risk factors (1). The ready availability of information on health disparities can aid investigators and policymakers in identifying areas of research and/or interventions where possible.
Biobanks, being repositories of large amounts of demographic and clinical data, are ideally suited for characterizing health disparities (2,3). The UK Biobank (UKB) is one of the largest and most mature biobanks that are available to researchers worldwide (4,5). Accordingly, the UKB offers an unprecedented opportunity to characterize the landscape of health disparities in the UK. Given the diverse, cosmopolitan nature of the population of the UK, with numerous immigrants from different Commonwealth countries, characterizing disparities using the UKB can support efforts to improve health equity for underserved minority populations.
We developed the UKB Health Disparities Browser as a means for researchers to explore the landscape of health disparities in the UK for groups defined by age, country of residence, ethnicity, sex and socioeconomic deprivation (SED). The browser includes prevalence data for 1147 diseases based on a cohort of >500 000 participants from the UKB. Users can browse and sort by disease prevalence and prevalence differences to visualize health disparities for each of these four groups, and users can search for diseases of interest by disease names or codes.

Study cohort
We used participant data from the UKB, a prospective cohort study set up to investigate the lifestyle, environmental and genetic determinants of a wide variety of diseases of adulthood (4). The study recruited >500 000 participants aged between 40 and 70 years between 2006 and 2010 (Supplementary Table S1). Participant data include completed questionnaires, nurse-led interviews, biological samples and deep clinical data gleaned from electronic health records.

Population attributes and comparison groups
We used the following participant data fields from UKB data: (i) age (Field 21003: age when attended assessment center) (6), (ii) assessment center (Field 54: UKB assessment center) Downloaded from https://academic.oup.com/database/article/doi/10.1093/database/baad026/7143539 by Georgia Institute of Technology user on 27 April 2023 (7), (iii) ethnic group and background (Field 21000: ethnic background) (8), (iv) International Classification of Diseases, Tenth Revision (ICD-10) codes (Field 41270: diagnoses-ICD10) (9), (v) sex (Field 31: sex) (10) and (VI) Townsend deprivation index (Field 189: Townsend deprivation index at recruitment) (11). Investigators from the UKB invited participants who lived within 25 miles of one of the 22 recruitment centers located across England, Scotland and Wales. Accordingly, we used the location of a participant's assessment center to determine their country of residence. We used the Townsend index of deprivation as a measure of SED. The Townsend index is a widely used, composite metric that incorporates (I) unemployment, (II) non-car ownership, (III) non-home ownership and (iv) household overcrowding in a given area (12). A higher value of the Townsend index indicates higher material deprivation and a lower value indicates relative affluence. A detailed description of these UKB data fields can be found on the UKB data showcase at https://biobank.ndph.ox.ac.uk/ showcase/.
Comparison groups were defined for each of the five population attributes studied here: age, country of residence, ethnic group, sex and SED. For age, participants were partitioned into four groups based on their age at recruitment (35-44, 45-54, 55-64 and 65-74 years old). For the country of residence, three groups were created (England, Scotland and Wales; the UKB did not have recruitment centers in Northern Ireland). For ethnicity, the initial UKB assessment questionnaire asked participants to identify as belonging to one of the six ethnic groups (Asian, Black, Chinese, Mixed, White or Other), and participants' self-identified ethnic groups were used for disease prevalence comparisons. Chinese is included as a distinct ethnic group compared to Asian, which includes individuals of Bangladeshi, Indian and Pakistani origin, following the convention of the UK National Health Service (NHS) and the classification provided by the UKB. The NHS makes this distinction owing to cultural, socioeconomic and ancestry differences between the larger South Asian and smaller East Asian immigrant groups in the UK. For sex, males and females were compared. For SED, the participants were divided into five equal groups using the Townsend index of deprivation quintiles.

Phenotype case-control cohorts
We used the UKB participants' ICD-10 diagnosis codes taken from UKB Field 41270 to define case-control cohorts using the phecode scheme defined by the PheWAS consortium (13,14). The ICD-10 codes include all distinct diagnosis codes that a participant has recorded across all of their hospital inpatient records, either in the primary or secondary position. ICD-10 codes for closely related diagnoses are aggregated into individual phecodes, each of which represents a single disease or health condition. Each phecode has an inclusion criterion that covers all ICD-10 codes corresponding to a single disease or health condition and an exclusion criterion that eliminated ICD-10 codes corresponding to closely related conditions. This approach enables the delineation of clearly distinct casecontrol cohorts for each individual disease or health condition in the phecode scheme. Individual phecodes have been manually curated and validated by physicians and experts. Disease cohorts that had <100 cases were excluded from the analysis for privacy reasons. Phecode case-control cohorts were curated for a total of 1147 diseases or health-related conditions after removing phecodes with ICD10 codes that are suppressed for diseases with <100 cases, are considered contentious or refer to protected characteristics, following UKB governance guidelines (Supplementary Table S2).

Disease prevalence and quantifying disparities
The crude prevalence for each of the 1147 diseases was calculated for the overall cohort, and each individual group was defined by the population attributes under consideration. We used crude prevalence, without controlling for age and sex, since our disparity browser includes comparisons between groups defined by age and sex. Crude prevalence was calculated as follows: where refers to the number of cases and refers to the number of controls.
For each population attribute under consideration, we calculated the range of prevalence values for each of the constituent groups as follows: …] along with calculating the ratio of the range of prevalence values as follows: Taken together, these two metrics enable the identification of health disparities for high-prevalence diseases (using the Range difference) and for those diseases with low overall prevalence values (using Range ratio). On plotting these two metrics orthogonally, we computed a unified disparity score defined as the Euclidean distance from the origin as follows: Within a population attribute, a relative disease burden was calculated for each group as follows: where refers to group-specific relative disease burden, refers to the number of phenotypes where has the highest prevalence and NullAvg refers to the null expectation calculated as 1147 ( is the number of groups for that population attribute). An value of 0 would mean that the Group in question has the highest prevalence for exactly NullAvg diseases. A high positive value would represent a disproportionately high burden of disease for the subpopulation Group, while a negative value would indicate a disproportionately low burden of disease.

Interactive web server
Data processing and analysis were done using the Pandas library in Python (15). Plots were made using the ggplot2 library (16) in the R statistical language v3.6.1 (17). The interactive webserver was developed using the Plotly Dash framework (18).

Health disparities across population attributes
Overall, we had information on the following population attributes for 501 117 participants from the UKB: age, country of residence, ethnic group, sex and SED (Table 1). Most of our analysis cohort falls primarily between the ages of 55 and 64 years (42.3%), resides in England (88.7%), identifies as belonging to the White ethnic group (94.2%) and is female (54.4%) (Supplementary Table S1). Leveraging the phecode schema (14), which specifies ICD-10 diagnosis codes and inclusion and exclusion criteria for phenotypes, we generated 1147 case-control cohorts. For each of the case-control cohorts, we calculated the prevalence of disease   in groups defined by the five population attributes under consideration. Next, health disparities were quantified as the difference and ratio of the range of disease prevalence among groups defined by population attributes under consideration (Figure 1; Supplementary Figures S1-S4). The two metrics employed-range difference and range ratio-were combined into a single, comparable metric by computing the Euclidean distance from the origin in a space parametrized by these two parameters. On comparing different population attributes, we find that ethnic groups show the greatest overall disease disparities (median disparity score: 2.62), followed by age (median disparity score: 1.63), country of residence (median disparity score: 0.86), SED (median disparity score: 0.70) and sex (median disparity score: 0.66) ( Figure 2).

Health disparities among groups defined by population attributes
To identify groups with disproportionately high disease prevalence across phenotypes, we quantified the relative disease burden for groups defined by each population attribute (Figure 3). This was done by calculating the deviation from the number of times a group had the highest prevalence of disease phenotypes compared to the null hypothesis of equally distributed disease prevalence. Among the groups defined by age, we find that participants aged between 65 and 74 years had the highest relative burden of disease (1.27), while those aged between 45 and 54 years seemed to have the lowest burden of disease (−0.71) in our analysis cohort. For groups defined by country of residence, those residing in England had the highest relative burden of disease (0.91) and those residing in Scotland had the lowest (−0.51). Those identifying as belonging to the Asian ethnic group had the highest relative burden of disease (0.52), while those identifying as Chinese had the lowest (−0.47). We see that the most socioeconomically deprived quintile of participants (Q5) has the highest relative burden of disease (2.00), while those in the third quintile seem to have the lowest (−0.69). We also find that females have a higher relative burden of disease (0.05) compared to males (−0.05); however, the difference for sex was comparatively small.
We identified the most disparate disease for groups defined by each population attribute under consideration (Table 2). We find that essential hypertension is a large health disparity across four out of the five population attributes studied here. Type 2 diabetes and hypercholesterolemia also stand out as showing disparate prevalence values across multiple population attribute groups. Prevalence values for each group defined by the different population attributes under consideration, along with the disparity metric, can be accessed using the interactive browser.

Interactive health disparities browser
The interactive browser was developed using the Model-View-Controller software design paradigm (19), which divides the program logic into three interconnected elements: the 'Model', the 'View' and the 'Controller'. This separation allows for easier management of the front-and backend components of the browser. In the Model-View-Controller framework, the 'Model' represents the data structures and databases that are queried, the 'View' represents the user interface and the 'Controller' represents the mediator between these two components ( Figure 4).  The browser allows researchers to identify health disparities among groups based on the population attribute of their choice. The browser displays disease prevalence values for each group defined using the chosen population attribute, sorted by the disparity score ( Figure 5A). There is another table that will help users select disease phenotypes by prevalence in groups ( Figure 5B). The tables with information on disease prevalence can be sorted using any of its columns and also allow for keyword searches. The summary statistics data underlying the browser can be accessed from the GitHub repository: https://github.com/healthdisparities/ UKB-Disparity-Atlas.

Discussion
Here, we describe the landscape of health disparities in the UKB participant cohort. We find marked disparities in disease prevalence for UKB participants defined by age, country of residence, ethnic group, sex and SED. Overall, ethnicity has the greatest effect on disease disparities, with the Asian group (Bangladeshi, Indian and Pakistani) showing the highest levels of disease prevalence and the Chinese group showing the lowest levels of disease prevalence. Coronary atherosclerosis and non-specific chest pain were detected as disparities specific to the Asian group. Sickle cell anemia and uterine leiomyoma were detected as disparities specific to the Black ethnic group; melanomas of the skin and diverticulosis showed relatively high prevalence in the White ethnic group. Older age and high SED were both associated with a relatively high burden of disease as expected. England showed a relatively high burden of disease compared to Wales and Scotland, which had the lowest country-specific disease burden. This seems to be attributed to higher SED for participants recruited from England compared to those from Wales and Scotland. Sex shows the lowest overall levels of disease disparities, and the largest disease disparities for this group are seen for sex-specific conditions, such as prostate cancer and uterine leiomyoma, as can be expected. Essential hypertension, hypercholesterolemia and type 2 diabetes show high prevalence differences across most population group attributes, whereas mental disorders show disparities for country of origin and SED.
Sampling bias represents one potential limitation of this study. UKB participants are generally healthier and wealthier than the general population, and this 'healthy volunteer' bias could affect the disease prevalence and disparity estimates reported here (20). Thus, the external validity of the results reported here, with respect to their correspondence to the general UK population, may vary by disease and population group. Notwithstanding this caveat, the health disparities landscape browser developed here should serve as a useful resource to guide follow-up studies of both the UKB cohort and the general UK population.