Quality of ethnicity data within Scottish health records and implications of misclassification for ethnic inequalities in severe COVID-19: a national linked data study

Abstract Background We compared the quality of ethnicity coding within the Public Health Scotland Ethnicity Look-up (PHS-EL) dataset, and other National Health Service datasets, with the 2011 Scottish Census. Methods Measures of quality included the level of missingness and misclassification. We examined the impact of misclassification using Cox proportional hazards to compare the risk of severe coronavirus disease (COVID-19) (hospitalization & death) by ethnic group. Results Misclassification within PHS-EL was higher for all minority ethnic groups [12.5 to 69.1%] compared with the White Scottish majority [5.1%] and highest in the White Gypsy/Traveller group [69.1%]. Missingness in PHS-EL was highest among the White Other British group [39%] and lowest among the Pakistani group [17%]. PHS-EL data often underestimated severe COVID-19 risk compared with Census data. e.g. in the White Gypsy/Traveller group the Hazard Ratio (HR) was 1.68 [95% Confidence Intervals (CI): 1.03, 2.74] compared with the White Scottish majority using Census ethnicity data and 0.73 [95% CI: 0.10, 5.15] using PHS-EL data; and HR was 2.03 [95% CI: 1.20, 3.44] in the Census for the Bangladeshi group versus 1.45 [95% CI: 0.75, 2.78] in PHS-EL. Conclusions Poor quality ethnicity coding in health records can bias estimates, thereby threatening monitoring and understanding ethnic inequalities in health.


Introduction
Ethnic inequalities in health have received heightened attention since the start of the coronavirus disease (COVID- 19) pandemic, which has disproportionately impacted some minority ethnic groups. 1,2The availability of high-quality ethnicity data alongside health records is crucial to monitoring, understanding and redressing these inequalities. 3However, ethnicity data from UK health records have until recently been of limited use owing to poor completeness and quality. 4hile there have been improvements, issues remain such as the inconsistent use of codes and high proportions of 'not known', 'not stated' or 'other' codes being used. 5,6For example, a study assessing the risk factors for severe COVID-19 early on in the pandemic in Scotland found that 15-26% of hospitalized individuals did not have their ethnicity recorded in NHS datasets. 7Such under-ascertainment can lead to the aggregation of heterogeneous ethnic groups, limited statistical power and the inability to effectively monitor ethnic inequalities in health. 7][10] Fixed-response categories commonly used to record ethnicity have been critiqued on these grounds, as has the aggregation of these categories in research. 8,11Nevertheless, fixed-response categories that are self-reported are generally viewed as preferable to those that are not (e.g.being ascribed by healthcare workers instead) for monitoring inequalities and quantitative research. 4As the Census in the UK adopts self-reported categories and offers almost complete population coverage, it is often considered the 'gold standard' and provides the best available ethnicity data. 12In contrast, health records are often incomplete and healthcare workers are not always aware that ethnicity should be self-reported.
This study aimed to (i) assess the quality of ethnicity coding within Scottish health datasets compared with the 2011 Scottish Census as the 'gold standard' and (ii) understand how differences in quality impact the observation of ethnic inequalities in severe COVID-19.

Study design
We used a cross-sectional analysis to assess the quality of ethnicity coding in Scottish health records compared with the 'gold standard' 2011 Scottish Census.We then used a population-based cohort analysis to explore the implications of misclassified ethnicity for assessing ethnic inequalities in severe COVID-19 (hospitalization or death).

Study population and inclusion criteria
Individuals included were aged ≥16 years, present in both the Community Health Index (CHI) register and the 2011 Census, and residing in Scotland on 1 March 2020, the day of the first laboratory confirmed case of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in Scotland. 13HI provides a unique numerical identifier for all registered patients in Scotland.

Data
We used data from the Early Pandemic Evaluation and Enhanced Surveillance of COVID-19 (EAVE II) platform linked to data from the 2011 Scottish Census. 14The EAVE II study includes data for around 99% of the Scottish population (described in detail elsewhere). 15We used the following datasets: Public Health Scotland Ethnicity Lookup (PHS-EL), Electronic Communication or Surveillance in Scotland for SARS-CoV-2 testing data, Scottish Morbidity Record 01 (SMR-01) for hospitalizations, National Records of Scotland death registry (NRS deaths) and Accident & Emergency (A&E) services.PHS-EL was created during the pandemic to improve ethnicity information for monitoring and surveillance purposes.It includes the most recent ethnicity code from numerous National Health Service (NHS) Scotland datasets (Appendix 1), prior to 24 January 2022, and at the time of analysis represents the best available ethnicity information from NHS Scotland.

Ethnicity
Ethnicity was classified using 16 categories from the 2011 Scottish Census, which were then aggregated to five categories for secondary analyses (Appendix 2); it should be noted, however, that the Caribbean and Black groups were available to us as a combined category despite existing in the Census as distinct categories.
Ethnicity data were complete in PHS-EL and the Census, with NRS having imputed for non-responses for the latter (2.1% for ethnic group in 2011). 16Cross-sectional analyses of SMR-01, NRS deaths and A&E were restricted to individuals with ethnicity codes in these datasets meeting the above inclusion criteria, as these datasets included 'Missing', 'Unknown' or 'Not provided' codes.The cohort analysis was restricted to individuals with ethnicity codes in PHS-EL meeting the inclusion criteria to provide a one-to-one comparison.

Outcome
Our outcome for the cohort analysis was COVID-19 related hospitalization or death (referred to as severe COVID-19).The former was defined as a COVID-19 International Classification of Diseases (ICD) 10 code (U07.1-7) as the reason for admission (any position), or a reverse transcription polymerase chain reaction (RT-PCR) confirmed positive test for SARS-CoV-2 in the 28 days prior to admission.The latter was defined as either a death where the relevant ICD-10 code was recorded as the primary or secondary cause of death, or any death where the individual had a positive RT-PCR test for SARS-CoV-2 infection in the 28 days prior to death.Confirmed infection was defined as a positive RT-PCR laboratory test result.Severe COVID-19 was chosen as an exemplar outcome to understand the implications of misclassification.As such, this analysis did not provide any definite exploration of the relationship between ethnicity and severe COVID-19.

Statistical analysis
Ethnicity coding in PHS-EL, SMR-01, NRS deaths and A&E were compared with the 2011 Scottish Census.We checked the level of misclassification in datasets compared with the Census.For the comparison with PHS-EL, we defined missingness as individuals present in the Census and CHI register without an ethnicity code in PHS-EL.We also calculated sensitivity and positive predictive value (PPV) for all comparisons in line with previous validation studies; sensitivity gave the proportion of individuals with a particular code in the Census who had a corresponding code in the comparator dataset, while PPV gave the proportion of individuals with a particular code in the comparator dataset who have a corresponding code in the Census. 17,18For the purposes of disclosure control, certain data from these analyses are withheld from publication to prevent differencing and low counts.
The cohort analysis used Cox proportional hazard models to estimate the risk of severe COVID-19 by ethnic group.All models were adjusted for age (5-year bands) and sex.Using calendar time, we followed individuals from 1 March 2020 (date of first SARS-CoV-2 case) until the first of: experiencing the outcome (hospitalization or death), death from any cause or 1 March 2022.We compared associations between models using ethnicity codes derived from the Census with models using coding from PHS-EL for these individuals.All analyses were conducted in the Scottish National Safe Haven using the R statistical software (Version 4.0.2).

Patient and public involvement
Patient and public involvement (PPI) was carried out by the EAVE II PPI Coordinator and Public Advisory Group Co-Lead in collaboration with the study team (Appendix 3), helping with the prioritization of research questions and interpretation of findings.

Cross-sectional analysis
A total of 3 776 564 unique individuals met the inclusion criteria for the analysis of PHS-EL.Socio-demographic characteristics for this population is presented in Appendix 4. Analyses of NRS deaths, SMR-01 and A&E concerned 141 726, 482 234 and 250 382 unique individuals, respectively.Owing to small numbers, socio-demographic characteristics for these populations are not presented.When aggregated to five broad categories, missingness was > 22% for all groups (see Table 2).Misclassification was highest among the Mixed or Multiple Ethnicity group [44.2%] and lowest among the White group [0.3%], with these groups being most often misclassified as each other (Table 2 and Appendix 5-C).All other groups were most often misclassified as the White group.Sensitivity and PPV were highest for the White group [Sensitivity = 69.4,95% CI: 69.3 to 69.4; PPV = 99.6,95% CI: 99.6 to 99.6] and lowest for the Other Ethnicity group [Sensitivity = 29.8,95% CI: 28.8 to 30.9; PPV = 26.9,95% CI: 25.9 to 27.9] (Table 2).

Census versus SMR-01, census versus A&E and census versus NRS
When assessing disaggregated groups, sensitivity and PPV were highest among the White Scottish group for all

Main findings of this study
We examined the quality of ethnicity coding within four Scottish health datasets (PHS-EL, NRS, SMR-01 and A&E) compared with the 2011 Scottish Census.Using severe COVID-19 as an example, we highlighted the implications misclassification has for the monitoring of ethnic inequalities in health.
For the main comparison between the 2011 Census and the best available ethnicity information from NHS Scotland (PHS-EL), we found that misclassification was lower in the  and lowest for the White group, with missingness > 22% across all groups.
What is already known on this topic?
9][20] It should be noted, however, that more recent validation studies have focused on aggregated ethnic groups only. 19,20Relatedly, studies and surveillance reports rarely reported dis-aggregated ethnicity data early in the COVID-19 pandemic. 21Aggregating groups can conceal important heterogeneity between ethnic groups 11 ; we show that aggregated groups conceal the elevated level of misclassification and risk of severe COVID-19 among the White Gypsy/Traveller group, as the aggregated White group is dominated by the White Scottish majority.
There has also been a lack of recent validity studies both across the UK and within Scotland. 17,20,22One previous study evaluated the accuracy of a name-based classification system using ethnicity coding from Scottish administrative datasets; however, the quality of coding was not considered, the datasets used were only regional and did not contain data on health outcomes, thus offering limited insight into the implications that ethnicity data quality has for the observation of health inequalities. 17

What this study adds
We provide a population-level evaluation of the quality of ethnicity data in Scottish health datasets and clearly demonstrate how both the quality and granularity of ethnicity coding influences the observation of ethnic inequalities in health.In particular, we clearly demonstrate this to be the case for the White Gypsy/Traveller group, which is important as sizeable health inequalities exist between Gypsy/Traveller and non-Gypsy/Traveller populations in the UK. 23Additionally, we are the first to show that this minority ethnic group is at an increased risk of severe COVID-19 in Scotland.More generally, we highlight that data linkage may offer improved ethnicity data quality.This approach differs, for example, from addressing data quality issues using name-based classification systems which are often of questionable validity, not least for individuals of Mixed or Multiple ethnicity. 4,17,22,24

Limitations of this study
We treat the 2011 Census as the 'gold standard' by which the quality of the other datasets is assessed.However, as ethnicity is labile, it is possible that for certain individuals PHS-EL (and other datasets) may represent a more up-to-date representation of an individual's ethnicity.A study in England and Wales found self-reported ethnicity to be stable for 96% of individuals between the 2001 and 2011 Censuses, although stability was lower among ethnic minority groups. 25Relatedly, it is possible that some individuals may have mistakenly provided the wrong ethnicity in the Census.Additionally, it is also worth considering that, for the cohort analysis, estimates for certain groups were derived from low counts of both individuals and events.Lastly, we did not examine whether individuals present in the CHI register or PHS-EL but not the Census (who were excluded) differed in their demographic characteristics from those included in our analyses.

Fig. 1
Fig. 1 Risk of severe COVID-19 (hospitalization and death) using disaggregated ethnicity coding from the 2011 Scottish Census (first estimate in row) and from the Public Health Scotland ethnicity lookup (PHS-EL) dataset (second estimate in row).

Fig. 2
Fig. 2 Risk of severe COVID-19 (hospitalization and death) using aggregated ethnicity coding from the 2011 Scottish Census (first estimate in row) and from the Public Health Scotland ethnicity lookup (PHS-EL) dataset (second estimate in row).
A total of 30% of individuals had missing ethnicity data in PHS-EL, this being highest among the White Other British group [39%] and lowest among the Pakistani group [17%] (Table 1).Overall, 8.5% of individuals in the Census were misclassified by PHS-EL (Table 1).Misclassification was highest for the White Gypsy/Traveller [69.1%; most often misclassified as White Scottish], Other Ethnicity [53.1%; most often misclassified as White Scottish] and Caribbean or Black [49.6%; most often misclassified as Mixed or Multiple Ethnicity] groups and lowest in the White Scottish group [5.1%] (Appendix 5 A and B).Sensitivity was highest for the Pakistani group [68.8%, 95% CI: 68.3 to 69.3] and lowest for the White Gypsy/Traveller group [3.9%, 95% CI: 3.2 to 4.7]

Table 1
Comparison of dis-aggregated ethnicity coding within 2011 Census to Public Health Scotland ethnicity look-up variableAs well as rounding the proportion (%) of missingness, the number of individuals missing and misclassified are not provided due to disclosure control.

Table 2
Comparison of aggregated ethnicity coding within 2011 Census to PHS ethnicity look-up variable