Data Resource Profile: The Education and Child Health Insights from Linked Data (ECHILD) Database

Data Resource Profile: The Education and Child Health Insights from Linked Data (ECHILD) Database Louise Mc Grath-Lone ,* Nicolás Libuy, Katie Harron , Matthew A Jay , Linda Wijlaars, David Etoori, Matthew Lilliman, Ruth Gilbert, and Ruth Blackburn University College London, Institute of Health Informatics, London, UK, Centre for Longitudinal Studies, University College London, Institute of Education, London, UK and University College London, Great Ormond Street Institute of Child Health, London, UK

Database is de-identified. It does not include any information that could be used to directly identify a person, such as names, addresses, postcodes or dates of birth. Access and outputs are strictly controlled and re-identification of individuals is not permitted. Ethical approval for the ECHILD project was granted by the National Research Ethics Service ( The ECHILD Database was created as part of the ECHILD project, a research study led by University College London in partnership with National Health Service (NHS) Digital and the Department for Education (DfE). The aim of the ECHILD project is to explore the inter-relationship between health and education outcomes for children and young people in England, particularly for vulnerable groups, such as those with adverse birth characteristics and chronic health conditions. For example, we are using the ECHILD Database to explore the impact of disruptions to health and education services during the COVID-19 pandemic on hospital attendances for children and young people in England. However, the ECHILD Database can be used for a wide variety of analyses, provided they benefit education, health and social care. The creation of the ECHILD Database was funded by ADR UK (Administrative Data Research UK), an Economic and Social Research Council (part of UK Research and Innovation) programme.
The ECHILD Database includes all children and young people in England who: (i) were born between 1 September 1995 and 31 August 2020; and (ii) had any record in Hospital Episode Statistics (HES) or the National Pupil Database (NPD). HES and NPD are national administrative datasets covering the whole population of children and young people in England who receive state-funded health, education or social care services, including those who were born outside the country. HES contains records for all hospital activity that is provided or paid for by the NHS in England, including births, inpatient admissions, outpatient appointments and accident and emergency (A&E) attendances. 1 NPD contains records related to state-funded education and use of children's social care services. 2 In total, the ECHILD Database includes linked HES and NPD records for approximately 14.7 million individuals. Information is included, where available, from birth up to an individual's 25 th birthday, as this is widely considered the end of adolescence. 3 HES and NPD are well-established datasets that collect and collate information on a regular, ongoing basis to create individual-level, longitudinal records. Data within HES and NPD are organized into separate modules and the earliest year of collection (baseline) varies for each module ( Figure 1); for example, pupil characteristics are available from 2001 and hospital admissions data are available from 1997. When the ECHILD Database was created in November 2020, all HES and NPD data that were available were included. The most recent time period during which data were available varied by data module due to differences in data collection periods and lags between data being collected and made available for research ( Figure 1); for example, information related to social care was available up to 31 March 2019, whereas information related to hospital admissions was available up to 31 March 2020. Currently, the ECHILD Database includes information for young people up to the age of 25 years; however, it will be further updated to include more recent HES and NPD data as they become available, thereby enabling follow-up of health records into adulthood.

Data collected
The HES data included in the ECHILD Database contains information from hospital records for all NHS patients in England, including demographics and standardized codes for diagnoses, symptoms and procedures relating to the care they have received. 4 HES data are collected by NHS Digital from care providers and are curated on an ongoing basis in four modules related to different aspects of hospital care. The HES Admitted Patient Care (APC) module records hospital admissions and treatment that requires the use of a hospital bed, including births and day cases. 5 The HES Critical Care module records treatment for the subset of admitted adult patients where constant support and monitoring in adult designated wards is required to maintain at least one organ (i.e. an intensive care or high dependency unit). The HES Accident and Emergency (A&E) module records attendances at A&E departments, including some walk-in centres and minor injury units. The HES Outpatients module records all outpatient appointments at English NHS hospitals and the independent sector (if commissioned by the NHS), regardless of whether the appointment was attended or not. Since January 1998, HES data have been routinely linked to Office for National Statistics (ONS) mortality data. 6 This mortality information is also included in the ECHILD Database. Table 1 illustrates the range of health-related HES variables that are included in the ECHILD Database.
The ECHILD Database also contains information related to education and children's social care from the NPD. The NPD is made up of several modules that are collected by the DfE from schools, local authorities and examination awarding organizations on an ongoing and statutory basis. 7 The NPD includes four education census modules that collect information about the characteristics  Information on diagnoses, treatments and procedures for each episode of care is recorded by clinical coders based on patient care records and/or discharge summaries using standardized codes. In the Admitted Patient Care, Critical Care and Outpatient modules, diagnoses are recorded using the International Classification of Disease (ICD) version 10, and treatments and procedures are recorded using the Office of Population Censuses and Surveys (OPCS) version 4. In the Accident and Emergency module, bespoke codes are used to record diagnoses and treatments 14 ; however, these are much more limited than ICD-10 and OPCS-4 codes. of pupils in different educational settings (Table 2). These pupil characteristics include age, gender, ethnicity, special educational needs (SEN), and free school meals (FSM) eligibility. Pupils are eligible for FSM if their parents are in receipt of certain mean-tested benefits, and eligibility is often used in research as a proxy for income disadvantage. 8 The education census modules do not collect information for the estimated 7% of children in England enrolled in a private school, 9 0.1% in a hospital school (authors' calculation from DfE statistics) 10 or 0.7% who are home educated (authors' calculation from Office of the Schools Adjudicator statistics). 11 These children will have incomplete educational information in the ECHILD Database; however, the vast majority of children in England (at least 92%) will have some educational information recorded in the ECHILD Database. The NPD also includes information on educational outcomes with modules related to absences, exclusions (whereby a child is temporarily suspended or permanently expelled from school), attainment in national assessments and examinations, and participation in post-16 education. In England, participation in education is compulsory up to age 18; however, after the age of 16 this can include a combination of education, apprenticeships, training and part-time work. The NPD has two social care modules which are both included in the ECHILD Database. The Children in Need (CIN) census collects information related to children referred to social care services and those identified as needing additional help from those services to support their health or development. 12 The CIN census includes child characteristics (e.g. age, ethnicity, gender), as well as details of social care referrals, reviews, assessments and use of child protection plans (for children assessed to be at risk of serious harm). The Children Looked After Return (CLA) includes information related to children in care, who are referred to as looked-after children in the UK. 13 The CLA contains information related to child characteristics, placements in out-of-home care and adoptions.

Production of the ECHILD database
HES and NPD include a pseudonymized identifier that allows records relating to the same individual to be linked across modules and over time (HESID 15 and anonymized Pupil Matching Reference (aPMR), 16 respectively). As there is no common pseudonymized identifier in HES and NPD, the ECHILD Database was created by NHS Digital by linking records based on identifiable information (specifically, name, date of birth, sex and chronology of postcode). According to N Libuy, PhD (written communication, February 2021), initial assessments of linkage quality in the ECHILD Database indicate very high linkage rates between HES and NPD records, with linkage rates improving from 94% for children born in 1996/97 to 98% linkage for children born in 2004/05. Full details of the linkage process to create the ECHILD Database have previously been published. 17 Briefly, the DfE extracted identifiers and associated aPMRs from NPD to separate them from attribute information about individuals' education and social care records. This information was securely transferred to NHS Digital. Similarly, at NHS Digital, identifiers and associated HESIDs were extracted from HES, separating them from attribute information about individuals' health and use of services. Deterministic linking algorithms were used by NHS Digital to link HESIDs and aPMRs and create a pseudonymized bridging file with indicators of link quality but no identifiable information. NPD and HES attribute data and HESID-aPMR bridging files were then separately transferred to a trusted research environment, the ONS Secure Research Service (ONS SRS), 18 and collated to create the ECHILD Database. The ECHILD Database is de-identified: it does not include any information that could be used to directly identify a person, such as names, addresses, postcodes or dates of birth. The pseudonymized identifiers it contains (HESID and aPMR) cannot be linked to real-world identifiers (such as, NHS or National Insurance numbers) by researchers, including the ECHILD project team.

Data resource use
The ECHILD project is currently using the ECHILD Database for large-scale, longitudinal research that explores inter-relationships across the domains of health, education and social care in childhood and adolescence. For example, we are exploring the relationship between gestational age at birth, chronic health conditions, school attainment and special educational needs (SEN) in later childhood. Previous research using linked administrative data in Scotland has indicated a dose-response relationship between gestational age at birth and risk of SEN. 19 The creation of the ECHILD Database means it is now possible to explore whether there is a similar relationship for children in England and how this association is related to chronic health conditions. These results will be useful for policy makers and service providers; for example, for estimating future need for SEN support in schools based on birth characteristics of the population. The ECHILD project is also using the ECHILD Database to explore the impact of disruptions to health and education services during the COVID-19 pandemic on hospital attendances for children and young people in England. In particular, it focuses on whether children with additional needs (e.g. with a chronic health condition, in care or receiving SEN support) were more affected by service disruptions than their peers. A full list of publications from the ECHILD project is available at: [https://www.ucl.ac.uk/child-health/echild].

Strengths and weaknesses
Health and education are strongly interconnected for children and young people. For example, children with chronic health conditions have higher rates of school absence and poorer school performance than their peers. 20 Acknowledging these inter-relationships, policy makers have called for greater collaboration between service providers in these domains. 21 HES and NPD are well-established administrative datasets for health and education in England. These datasets act as an evidence base to inform policy: they are used to produce national statistics by government departments and for wider research purposes by the academic community. However, the lack of a common identifier in these administrative datasets has limited the potential for wide-scale analysis across domains. A strength of the ECHILD Databases is that by linking data across these domains, it presents a unique and valuable opportunity to explore how children's health affects their education, and how their education affects their health. A further strength of the ECHILD Database is that it brings together de-identified information from long-standing, national administrative datasets. These constituent datasets include longitudinal, individual-level information for a large, wholepopulation-based cohort of children and young people. The large sample size (14.7 million individuals) and long followup period (up to 25 years) in the ECHILD Database will enable research into long-term outcomes and rare exposures. The constituent datasets are also well documented by data owners 1,22 and the research community. 4,5,12,13,23,24 This means that details about how information in the datasets is collected, what variables they contain and how coding has changed over time, for example, are readily available to researchers. The key strength of the ECHILD Database is that the use of the data is safeguarded by being de-identified and accessible only via a trusted research environment, the ONS SRS. In the ONS SRS, there is strict monitoring of data access and use and scrutiny of outputs, including any tables, graphs or figures. Researchers using the ONS SRS have to undergo training in governance procedures and sign data access agreements that prohibit any attempt to re-identify individuals. Such attempts could lead to data access for their institution being revoked.
The main limitation of the ECHILD Database is that administrative datasets are not collected for research purposes. This may have implications for the type of research that can be carried out and how findings are interpreted. 25 For example, HES data are primarily used for payment purposes (i.e. care providers are reimbursed from NHS England through the 'Payment by Results' system), and so there may be differences in the quality and completeness of the information that is recorded based on the impact it has on payment. A further limitation is that the range of information collected in HES and NPD varies over time, and earlier years of data and linkage are considered to be of lower quality. However, both datasets are subject to data quality assurance checks at the point of submission to DfE or NHS Digital and are considered of high enough quality to produce national statistics. There are also issues related to missing data in non-mandatory HES variables; for example in the HES Outpatient module, primary diagnosis is not recorded for 95% of records. 4 These limitations are well documented, but it is important that researchers using the ECHILD Database familiarize themselves with the constituent datasets to understand the potential limitations and caveats of their proposed analyses.

Data resource access
The ECHILD Database will be available to accredited researchers in 2021 by applying to the data providers (DfE and NHS Digital). Further documentation about the ECHILD Database, including an introductory user guide 17

Author contributions
LM-L wrote the manuscript, with critical input from all authors. All authors approved the final manuscript.