Inconsistency in UK Biobank Event Definitions From Different Data Sources and Its Impact on Bias and Generalizability: A Case Study of Venous Thromboembolism

Abstract The UK Biobank study contains several sources of diagnostic data, including hospital inpatient data and data on self-reported conditions for approximately 500,000 participants and primary-care data for approximately 177,000 participants (35%). Epidemiologic investigations require a primary disease definition, but whether to combine data sources to maximize statistical power or focus on only 1 source to ensure a consistent outcome is not clear. The consistency of disease definitions was investigated for venous thromboembolism (VTE) by evaluating overlap when defining cases from 3 sources: hospital inpatient data, primary-care reports, and self-reported questionnaires. VTE cases showed little overlap between data sources, with only 6% of reported events for persons with primary-care data being identified by all 3 sources (hospital, primary-care, and self-reports), while 71% appeared in only 1 source. Deep vein thrombosis–only events represented 68% of self-reported VTE cases and 36% of hospital-reported VTE cases, while pulmonary embolism–only events represented 20% of self-reported VTE cases and 50% of hospital-reported VTE cases. Additionally, different distributions of sociodemographic characteristics were observed; for example, patients in 46% of hospital-reported VTE cases were female, compared with 58% of self-reported VTE cases. These results illustrate how seemingly neutral decisions taken to improve data quality can affect the representativeness of a data set.

TheOur aim ofin this investigation is to determine how using different sources of data may impactaffect VTE case populations within the UK Biobank.We will do this by considering how closely reports of VTE from different data sources correspond and whether the populations reported as cases are similar.We will not consider any specific reporting method as a "gold standard" of truth to determine the accuracy of other methods, nor will we attempt to estimate VTE incidence in the general UK population.Instead, we will compare how similar each definition is to the others within the UK Biobank.

Study participants
The UK Biobank is a large prospective cohort study containing diagnostic data for 503,317 participants, aged 37 to -73 years, who were recruited across England, Scotland, and Wales between 2006 and 2010.

Data sources within the UK Biobank
Self-reported outcomes.
At enrolmentenrollment and resurvey, participants answered a touch-screen questionnaire, including specific questions about prior physician diagnoses of blood clots in the leg or lungs as well as more general questions about serious medical conditions.These were followed up with a verbal interview (35) (36).Where participants were not certain about prior diagnoses, their responses were matched where possible to health conditions in a coding tree by a medical professional (36) (37).Self-reported VTEs were coded as either DVT, PE, or other VTE.
Hospital data.

ICD-9 and -10International Classification of Diseases, Ninth Revision, and International
Classification of Diseases, Tenth Revision, coded hospital inpatient episodes were obtained from the Hospital Episode Statistics provider for England, the Patient Episode Data for Wales, and the Scottish Morbidity Records for Scotland (37) (38).These datasetsdata sets contain information on admission and discharge, operations, diagnoses, maternity care, and psychiatric care.Main and secondary diagnoses throughout the patientpatient's admission are recorded.ThisThese data isare only available within the UK Biobank for patients who are admitted to the hospital and occupy a bed.

Death certificate data.
ICD-10International Classification of Diseases, Ninth Revision, coded national death registry data were obtained from the Health and Social Care Information Centre (now NHS England) for England and Wales, and the Information Services Department (ISD) for Scotland (38) (39).This includes primary and secondary causes of death determined by a doctorphysician who attended the patient in their last illness or a coroner (39) (40).

Primary -care records.
Primary -care data were captured for 230,000 participants, covering records from selected general practice services in England, Scotland, and Wales (40)(41).We took a subset of 177,363 participants that ensured continuous coverage overlapping with their recruitment into the UK Biobank-details.Details on choices made can be found in Web Appendix 1.

Event definitions
VTE cases were determined using the four4 data sources: death certificate data, hospital data, self-reported outcomes, and primary -care records (in the primary -care cohort only).We considered events reported by each data source in turn, andas well as a combined outcome including events reported by any data source.

VTE cases were broken down into PE and DVT via matching to any of the codes in
Web Table 1.If a participant matched to VTE, but not to PE or DVT, they were classed as "Otherother VTE".."

Medication use
One concern about the self-reported outcomes is that case definitions may be much less accurate.To investigate this concern, we considered whether patterns of relevant medication use were similar between the cases reported via different sources.
VTEs are often treated with anti-coagulants.anticoagulants.While warfarin is not recommended as the first line treatment in the current UK guidelines, the standard of care prior to 2020 was a low -molecular-weight heparin bridge followed by warfarin (33) (34).
There are two2 sources of general medication usage data within the UK Biobank.One is self-reported data, collecting lists of all regularly taken prescription medications during the touch -screen questionnaire and verbal interview (data-field 20,003) (35).20003)(36).The other source is the primary -care records prescription data, which is onlyare available only for the cohort with primary -care data (data-field 42,03942039).Matching on drug names was undertaken to identify participants who had taken either any anticoagulant or warfarin at some point (details of matching are shown in Web Table 2).

Statistical methods
We cross-tabulated the events in both the full UK Biobank sample and in the cohort with available primary -care data.In both groups, we compared anti-coagulantanticoagulant medication use and demographic data defined by the various data sources.The variables we compared were age at baseline, gendersex, smoking status, ethnicity, body mass index (BMIweight (kg)/height (m) 2 ), current employment status, highest level of education, history of manual or shift work, Townsend deprivation index (a greater meansscore reflects more depriveddeprivation), house ownership, and car ownership.For participants in England, we also looked at the Index of Multiple Deprivation (IMD) and the scores that determine the IMDIndex of Multiple Deprivation.
Proportional Venn diagrams were plotted to getobtain a visual understanding of the various overlaps of cases.Agreement between the methods was evaluated using the Fleiss κ value between all of the sets and CohanCohen's κ pairwise between each method.κKappa coefficients < less than 0.6 are taken asto indicate inadequate agreement, in line with recommendations for health-related studies,; those[SB3] of 0.6-0.87 are taken as moderate agreement, 0.8-0.9 as strong agreement, and >greater than 0.9 as almost perfect agreement (41) (42).

Study population
Table 1 contains a summary of the overall demographicsdemographic characteristics of the UK Biobank.participants.The defined primary -care cohort reproducesreproduced the known biases within the UK Biobank datasetdata set-there iswas a "healthy volunteer bias"," with participants being more likely to be older, to be female, to have a lower BMI,body mass index, to smoke less, to live in less socioeconomically deprived areas, and to have a greater rate of higher education[SB4] than the average person in the UK 43 United Kingdom (43) (Web Table 3).The primary -care cohort hashad a similar gendersex and age balance, a slightly greater proportion of White participants, higher rates of unemployment, and lower rates of higher education compared withthan the full datasetdata set.

Event definitions in the full UK Biobank sample
No single data source capturescaptured all VTE cases, and the percentage captured by different methods variesvaried by case type.Taking a report from any source as a case and breaking cases down by sub diagnosis:subdiagnosis, 13% of participants had both PE and DVT, 54% had DVT only, 30% had PE only, and 3% had a VTE that fit into neither category.
There iswas little agreement between self-report data and inpatient hospital data (κ = 0.32).Only 20.2% of VTE cases arewere reported by both sources (Figure 1), while 79.8% appearappeared only in a single source (51.5% appearappeared only in self-reports and 28.3% appearappeared only in hospital data).There iswas a larger overlap of hospital events being self-reported when we restrictrestricted the data to prevalent events, but we dodid not see a matching hospital report for the majority of incident self-reported events (see Web Figure 1) (available at https://doi.org/10.1093/aje/kwad232).
The data from death certificates did not add any additional clarity (Web Tables 4 and   5, Web Figures 2 and 3).There were 741 cases of VTE identified from primary and secondary causes of death, of which 388 cases did not appear in another data source.Due to the small proportion of cases identified through this method (~(approximately 4%), we did not analyze death certificate data further.
Considering the two sub-diagnoses2 subdiagnoses (DVT and PE), there iswas a difference in the reporting source of the report by sub diagnosis: moresubdiagnosis: More DVTs arewere only self-reported (67.6%) than arewere in hospital records only (16.3%) or both in the hospital records and self-reported (16.1%), whereas PE arePEs were most likely to be in hospital records only (46.9%)%), although nearly a third appearappeared only as selfreports (32.3%).
The proportion of DVT to PE events also variesvaried with the data source (Figure 2).
If we considerconsidered only hospital data, 50% of events arewere PE only, 36% were DVT only, and 9% were both;, whereas when taking self-reported outcomes as the data source, 20% of events arewere PE only, 68% were DVT only, and 11% arewere both.
We also seesaw variation in the demographicsdemographic characteristics of the identified cases (Table 2).For example, using only hospital data, the case population iswas 45.7% female, while using the self-reported data the case population iswas 58.3% female.

Event definitions in primary -care cohort
The primary -care cohort within the UK Biobank shows similar showed patterns of case overlap similar to those of the full participant group (Figure 3).Adding the additional cases from the primary -care records doesdid not explain many of the undocumented selfreported VTE events and addsadded an additional set of otherwise uncaptured outcomes.
The highest agreement iswas between hospital data and self-reported data (κ = 0.33), but this iswas still inadequate in terms of concordance.Primary -care data havehad slightly more agreement with hospital data than self-reported data (κ = 0.29 vs. 0.21).Only 5.5% of VTE cases arewere reported by all three3 sources, while 71.3% appearappeared only in a single source: 43.9% appearappeared only in self-reports, 21.8% appearappeared only in hospital data, and 5.6% appearappeared only in primary -care reports.Splitting the data into prior and post registration and postregistration, there iswas a clear time-period effect due to the lack of self-reports post registrationpostregistration for many participants and the sparsity of hospital reports prior to registration.In all cases, the primary -care data and the hospital data havehad little overlap.(Web FigureFigures 4 and Web Figure 5;, Web Tables 6-8).
There iswas a difference in the source of the report for the sub diagnoses: mostsubdiagnoses: Most DVTs arewere only self-reported (60.4%), while more PEs arewere in hospital records only (36.1%) than in any other category.There iswas slightly better agreement between sources for PE (κ between= 0.33-0.35when comparing hospital data, primary -care data, and self-reportreports) than for DVT (κ between= 0.14-0.27).(See Web Table 9..)

Anticoagulant usage in the primary -care cohort
There arewere different pattens of reported anticoagulant use between the different case groups, but all havehad much higher rates than the control group (Web Table 10).Cases ofPatients whose VTE was identified using only hospital data arewere more likely to have a record of anticoagulant drug use at some point in their primary -care records (64.7% used some sort of anticoagulant,; 50.4% were on warfarin)), whereas those identified via primarycare records only had much lower use (37.9% and 26.8%%, respectively).Self-reported -only cases fell between these two2 groups (51.2% and 33.2%%, respectively).In contrast, anticoagulant drug use amongstamong controls (that is,i.e., individuals with no reported VTE event from any source) was much lower (18.9% and 2.5%%, respectively).This providesprovided an indication that there arewere likely to be true VTE events amongstamong the self-reported -only cases.Self-reported rates of anticoagulant use were much lower, but more consistent between definitions (Web Figure 6).

Differences in socio-demographicssociodemographic characteristics between cases from each data source
The self-reported cases arewere younger and more likely to be female than the hospital data cases.They arewere more likely to have been assessed at the UK Biobank centrescenters in Wales, and less likely to have been assessed in Scotland.There arewere also differences between these two2 case groups in terms of mean BMIbody mass index, house ownership, and multiple car ownership.The cases identified by primary -care data arewere somewhere between the other two2 case groups in terms of both gendersex and age, with lower levels of deprivation, and higher rates of house and multiple car ownership.

DISCUSSION
Our investigation found that using different data sources in the UK Biobank results in substantial differences in the number, balance, and socio-demographicsociodemographic characteristics of VTE cases considered.None of the data sources havehad good agreement with each other.The majority of DVT events appearappeared only as self-reported outcomes.
For PE, the largest group of events iswas reports from hospital data only.One likely reason for this is severity, with DVTs being more likely to be treated in outpatient settings (34) (35) while PE is more often life-threatening, resulting in hospitalization.Hospital reports constituteconstituted the majority of post-registrationpostregistration events in the study, while the majority of events prior to registration arewere self-reported.However, this iswas not a pure effect of time-period, as there arewere self-reports after registration that arewere not seen in hospital records and hospital reports before registration than arewere not-self reported.For both diseases, only a small proportion of participants arewere detected by multiple data sources as having an event.This suggests a need to be attentive to how use of different data sources may influence case definition and composition.
Large studies intoof patient characteristics affect our perception of diseases: forFor example, studies claiming that VTE predominantly affects male or female patients likely impactprobably affect physicians' perceptionperceptions of reported symptoms, as has been seen for cardiovascular disease (42) (44) and depression (43)(45).This can impactaffect how readily they diagnose future patients.As a result, decisions drawn from biased data can lead to greater health inequality, as has previously been observed for algorithmic decisions (44,45) (46,47).Future studies can also be biased by these perceptions, with well-meaning and seemingly neutral decisions taken to improve data quality impactingaffecting the representativeness of subsequent research findings using the same case definitions.

Accuracy of self-reported data for determining health outcomes
Self-reported outcome data are often viewed unfavorably compared towith hospitalreported or physician-collected data.However, several studies considering the accuracy of self-reporting of VTEs compared toin comparison with physician -collected data have found little to substantiate this view.Heckbert et al. looked at (48) evaluated the agreement between self-reportreports and hospital discharge codes for 99,500 participant reports in the Women's Health initiative.The concordance between self-reported and hospital-reported events was good (κ = 0.67 for PE, and κ = 0.71 for DVT).However, both self-reported and hospitalreported events had higher concordance with physician-adjudicated events for PE (κ = 0.83 and κ = 0.84, respectively),) and for DVT (κ = 0.72 and κ = 0.80) (46)(48).This is.These are much higher levels of agreement than we saw in the UK Biobank, which may be because participants were asked specifically about PE and DVT, whereas the UK Biobank asked an open question about physician -diagnosed conditions.Another possibility is that that the low overlap is becausereflects the fact that the self-reports referred mostly refer to events occurring prior to registration, while the bulk of the hospital data iswere collected after registration.Several much smaller studies have found similar concordances.Frezzato et al.
showed (49) demonstrated that the question, "Do you think you ever had venous thromboembolism?" had a sensitivity of 84% and a specificity of 88% compared towith medical records inamong 267 Italian participants (47).Greenbaum et al. ( 50) found an 88.9% positive predictive value for PE and a 69.7% positive predictive value for DVT when comparing self-reportreports with surgeon assessmentsurgeons' assessments in a US cohort of 3,976 post-surgerypostsurgery patients (48).This leads us to conclude that there is no strong inherent reason to disregard the self-reported data on VTEs as less accurate than the medical reports.
There is also a considerable body of literature on potential sources of bias in externally validated data.One concern is informed presence bias, which is influenced by socioeconomic factors, such as healthcarehealth-care costs (49)(51), levels of education, educational level, and distance to travel to healthcare (50).from[SB5]health-care services (52).
Perceptions about the healthcarehealth-care system can impact patientaffect patients' willingness to self-report outcomes (51) (53).Poor communication between patient and clinician could also be a factor in discordance, as more complicated conditions are both harder to diagnose (and thus underreported in medical records) and hard to understand harder for the patient to understand (and thus mis-reportedmisreported or underreported in selfreported data).This might explain why patients are more likely to self-report DVTs, a more commonly understood illness than PE.These factors mean that two2 patients with identical symptoms and underlying conditions may be represented[SB6] differently in different data sources.

Biases impactingaffecting VTE reporting
We found that changing the data source for defining a VTE outcome from hospital data to self-reported data altered the socio-demographicsociodemographic characteristics of cases under consideration.There iswas also noticeable variation between VTE case proportions by assessment centre-center; this could behave been due to underlying geographic variation[SB7] in NHS provision (52) (54).Self-reported cases were younger and majoritymostly female, while hospital-defined cases were majoritymostly male.This is particularly salient given conflicting evidence on whether VTEs are more prevalent in male or female patients and the impact this perception might have on subsequent diagnoses.
Several different factors might explain why women are more likely to self-report a disease without an equivalent medical record.Previous investigations have suggested age as a potential reason for differences in case prevalence, and self-reported events are mostly captured prior to registration (2, 3, 19)20).Prevalent events are subject to survivorship bias, but it is unclear why this would induce a gendered difference.We observed a difference in the mean age of cases between self-reported and hospital data,; the absolute difference was 0.6 years.However, this difference is unlikely to explain such a large discrepancy in gendersex rates.It is also possible that this discrepancy is a result of gendersex bias in diagnosis.Diagnostic and treatment bias onaccording to sociodemographic factors is welldocumented for cardiovascular diseases.Worldwide, women are less likely to undergo a detailed risk factor assessment for cardiovascular disease even when doctorsphysicians are presented with identical symptoms (53, 54)(55, 56) and are more likely to be misdiagnosed (55, 56) (57,58) or to have their symptoms dismissed as psychogenic (57)(59).Women with cardiovascular symptoms are less likely than men to be referred to a specialist (58,59)(60, 61), and to receive advanced diagnostics (60) (62), coronary procedures (61, 62)(63, 64), and appropriate drug treatment ((63-65)-67).It is unclear to what extent this generalizes specifically to VTEs: aOne study of DVT events found more women than men were sent for a diagnostic workup for DVT, but the actual diagnosis of DVT was higher in men with more severe thrombotic events (66)(68).However, women have poorer quality -of -life outcomes 1 year after diagnosis (67) (69), worse bleeding outcomes, and more VTE mortality in long -term follow up 70 -up (70).Given this, the magnitude of the impact of gendersex bias on VTE reporting is uncertain.

Strengths and weaknesses of study
The strengths of our study include the large sample size.The previouspreviously largest study comparing self-reported VTEs towith hospital reports usedwas the Women's Health Initiative study.This study was one-fifth the size of the UK Biobank study, and only considered outcomes in women (46).Otherwere considered (48).In other comparisons, study populations have either been much smaller (21, 47, 68) (22,49,71), or looked atinvestigators have evaluated a more general cardiovascular outcome ((69-72)-75).Our results are also strengthened by the robust data linkage between the self-reported and hospital data.We know that the same participants appearappeared in both datasets,data sets; thus, differences in prevalence arewere not due to the biases of different samples but due to how well the different sources capturecaptured case numbers.Use of the linked primary -care data allowed us to investigate whether primary -care diagnosis could explain why so many participants only self-report VTE events.A further strength is the comparison of DVT and PE events.
AsBecause these events are caused by related biological mechanisms, differences in diagnosis and reporting patterns between the diseases will more strongly represent differences arising from social, behavioral, and clinical factors.
Weaknesses inof our study include the fact that the UK Biobank is not representative of the UK population.This non-representativenessnonrepresentativeness limits our ability to extend conclusions beyond this datasetdata set, and none of the figures given here should be used as accurate estimates of the prevalence of VTEsVTE in the UK population.
Nevertheless, the UK Biobank has a large influence on health research and perceptions about medical conditions worldwide.As such, it is vital to identify potential biases that may be introduced in considering particular data sources within the UK Biobank.
Another weakness is that the choices made in defining our primary -care cohort may introducehave introduced additional bias.The primary -care cohort appears to be reasonably representative of the UK Biobank population in most characteristics but is distributed differently geographically.We acknowledge both the reproduction of the original biases of the UK Biobank and the possible intensification of them.
It is unclear whether these patterns found for VTEs in the UK Biobank would be similar for other conditions.Studies have found that more widely recognized and easily diagnosed illnesses tend to have greater agreement between self-reportreports and official records (71)(74, (72) 75), that community -managed conditions are less likely to be reflected in hospital records 78 records (73), and that more serious diseases have higher agreement between sources 81 .sources(76).However, there is not much consistency in how accuracy is reported between these studies, and it is difficult to concludeform a conclusion about whether a specific disease will have a strong overlap between hospital records and self-report.We would expect, in line with the previous literature, that these differences will be less marked for more common and more well-known diseases, and for diseases wherefor which there are empirical tests for diagnosis; this is reflected in our findings for DVT and PE.

Recommendations
For studying VTEs, and in general, we recommend that researchers look at the reports coming from all the possible sources within the UK Biobank, how the reports overlap, and whether there are any clues in the medication or demographic data that may help them identify the most appropriate definition to use.Using selfSelf-reported data isare particularly useful for identifying cases before baseline, while hospital data will capture more incident events.IncludingInclusion of primary -care data may create a lowerresult in an analysis with less powered analysis due to the smaller number of participants with available data, but it has the potential to capture events rarely seen in hospital, such as depression.While primary -care data were not useful in validating self-reported events in the case of VTEs, there may be conditions where there is a much larger overlap between sources of report, in which case the self-reported data could be used as a proxy for the missing primary -care data in the full cohort.We would recommend that self-reported data be included for case definitions of VTE, either as sensitivity analysis alongside a more parsimonious main definition or as the primary analysis together with a sensitivity analysis that excludes the self-report data.This gives the researcher the greatest flexibility for understanding the impact this decision hasmight have on their analysis.
In conclusion, there are large differences between the VTE case populations defined based on routinely collected hospital data and those defined based on self-reported data in the UK Biobank, both in terms of both the number of events reported and the demographicsdemographic characteristics of the case populations.Such differences are likely to affect our perception of the typical VTE patient.As such, our findings suggest that in future studies, researchers need to take be aware of potential demographic differences underlying seemingly neutral event definitions in order to avoid entrenching further inequalities in healthcarehealth care.Abbreviations:???Abbreviation: SD, standard deviation.
a Answered "Usuallyusually" or "Alwaysalways" to the question, "Does your work involve heavy manual or physical work?" ?".
b Answered "Usuallyusually" or "Alwaysalways" to the question, "Does your work involve heavy manual or physical work?" ?".

Figure 3 .
Figure 3. Venn diagram of the proportionalProportional overlap in VTEvenous

Table 2 .
Demographic Comparison betweenBetween Case Populations Defined Via thevia Different Data Sources, UK Biobank, 2006-2010 a A larger version of this table can be seen as Web Table 11.P values are reported for independent sample t-tests for continuous variables and χ 2 a