Cohort Profile: The Green and Blue Spaces (GBS) and mental health in Wales e-cohort

Cohort Profile: The Green and Blue Spaces (GBS) and mental health in Wales e-cohort Daniel A Thompson , Rebecca S Geary, Francis M Rowney, Richard Fry , Alan Watkins, Benedict W Wheeler, Amy Mizen, Ashley Akbari, Ronan A Lyons, Gareth Stratton, James White and Sarah E Rodgers* Population Data Science, Swansea University Medical School, Faculty of Medicine, Health and Life Science, Swansea University, Swansea UK, Department of Public Health, Policy and Systems, University of Liverpool, Liverpool, UK, European Centre for Environment and Human Health, University of Exeter Medical School, Knowledge Spa, Royal Cornwall Hospital, Cornwall, UK, Department of Sport and Exercise Sciences, Applied Sports Technology, Exercise and Medicine A-STEM Research Centre, School of Engineering and Applied Sciences, Faculty of Science and Engineering, Swansea University, Swansea UK and Centre for Trials Research, School of Medicine, Cardiff University, Cardiff, UK

is constructed using data from the Welsh Demographic Service Dataset (WDSD). This dataset contains demographic characteristics of everyone registered with a general practitioner (GP) in Wales, providing data to the SAIL databank (80% population coverage 15 ). It is used as the primary population register in the SAIL Databank. The WDSD contains the names and addresses with from-to dates of residency in each home; these are updated when patients inform their GP they have moved home. Researchers accessed an anonymised version of the WDSD, and calculated residency dates in each home and also house moves. All members of the household are included in the cohort, with individuals nested within each household.
The demographic dataset was used as the population spine, with additional data linked as follows: • Welsh Longitudinal General Practice (WLGP): information on symptoms, diagnoses, prescriptions, and referrals 1 ; • Annual District Death Extract from the Office of National Statistics (ONS) mortality register 2 ; • Welsh Index of Multiple Deprivation (WIMD), the Welsh Government's official measure of relative deprivation for small areas in Wales 3 ; • Rural-urban ONS classifications at Lower Layer Super Output Area (LSOA) 4 ; • National Survey for Wales (NSW), an annual, repeated, cross-sectional survey of about 12 000 adults in Wales (2016-17 16 and 2018-19 17 surveys) including responses on wellbeing and visits to outdoor spaces.
The cohort comprises 2 801 483 individuals-all persons aged 16 and over registered with a practice providing GP records to the SAIL Databank. We intentionally removed people who did not fit with the cohort criteria ( Figure 1). We excluded 839 063 individuals who had missing data, e.g. they were not registered with a GP providing data to the SAIL Databank, did not have a Welsh residential address between January 2008 and October 2019 or did not have sex or week of birth recorded in WDSD.
We created measures of GBS exposure and access for all homes in Wales, using several environmental datasets: (i) satellite data (Landsat TM 18-21 2008-19) to create annual greenness densities of the mean Enhanced Vegetation Index (EVI) and Normalised Difference Vegetation Index (NDVI) within 300 m of each residence; (ii) Ordnance Survey MasterMap Topography Layer 22 (2018) to capture natural and man-made features, including the outline of homes and parks; (iii) Ordnance Survey MasterMap-derived Greenspace dataset (2018) 23 ; (iv) local authority (LA) technical advice notes, legally required records of data on sport, recreation and open spaces managed by local authorities (LAs); (v) open source portal data from Lle (forestry, urban tree cover) 22 ; and (vi) OpenStreetMap road/footpath data. 24 Environmental data were linked to the cohort at individual-level data, using a residential version of the split file linkage process. 25,26 A final GBS typology (Supplementary Table S1, available as Supplementary data at IJE online) was used to create GBS access metrics for each home in Wales.
A cohort subgroup responded to Natural Resources Wales (NRW) questions in the 2016-17 and 2018-19 National Survey for Wales (NSW). 16,17 The NSW is an annual repeat, cross-sectional, government-sponsored, omnibus survey of a representative sample of the population of Wales (annual n 12 000). Topics include education, culture, health and wellbeing and more detailed information on socioeconomic circumstances than administrative data. The NRW questions (sub-sample, n ¼ 5312) 27,28 record whether respondents visited outdoor spaces in Wales, including time spent outdoors on leisure activities, and types of activities undertaken. NSW respondents aged 16 years,

Key Features
• The Green and Blue Spaces (GBS) e-cohort includes 2.8 million UK adults  and was established to quantify the impact of natural environments on mental health and wellbeing in Wales, UK.
• This is the first e-cohort with national household-level longitudinal environment metrics (annual) for 1.4 million residences linked to longitudinal electronic health records (updated quarterly), with a subgroup of 5312 linked survey responses on visits to outdoor spaces and wellbeing.
• Baseline and follow-up information was extracted quarterly through electronic record linkage, including mental health service use and sociodemographic and economic characteristics.
• After almost 12 years' follow-up, 0.7% were lost to follow-up due to migration out of Wales and were replaced with in-migration and those reaching the age of 16 years (25%), 9.9% died and 28% had at least one common mental health episode recorded with their general practitioner (GP).
We derived environmental metrics for all potential residences in Wales (n ¼ 1 498 120). Of these, 1 179 817 (78%) residences were linked to the cohort through the WDSD. There were 318 303 unlinked potential homes (likely holiday homes, caravans, guest-houses), either because they did not match an address of an individual registered with a GP in Wales or were inhabited by people not registered at a GP practice. Area-level characteristics of residences linked and unlinked to the cohort were compared to check for potential bias (see 'What has it found?'). Of the 2 801 483 individuals in the cohort, 622 025 (22.2%) moved home once between 2008 and 2019, and 567 877 (20.3%) moved home more than once. Exposures and outcomes are extracted/updated quarterly.
How often have they been followed up?
Health-related outcomes were extracted quarterly. Environmental metrics were calculated annually but updated quarterly if cohort members moved home (see 'What has been measured'). The dynamic cohort design allows new people to enter the cohort each quarter as they reached age 16 years or moved into Wales. Cohort sample size in each quarter is provided in Supplementary Table S2 (available as Supplementary data at IJE online). The current linkage of environmental and administrative data sources ended in September 2019, creating an 11-year cohort with annual follow-up for all, and quarterly follow-up for people moving home. Nonenvironmental datasets are routinely updated in SAIL, enabling health outcomes for the cohort to be followed up for longer. A total of 5 791 cohort members completed NRW questions in the 2016-17 and 2018-19 NSW. Further waves of the NSW have been consented for data linkage in SAIL.
The GBS e-cohort cohort was created from multiple data sources with varying levels of completeness across different variables. Known exclusions, due to missing data on age or sex (0.4%) or at least one primary environmental measure (EVI, <0.01%), resulted in a cohort of 2 801 483 people ( Figure 1). This cohort has 24.9 million-personyears of follow-up. An additional average of 30 238 people joined the cohort annually through migration into Wales or reaching age 16 years (34 709 people annually),

What has been measured?
Cohort variables are presented in themes: (i) sociodemographic and economic characteristics; (ii) common mental health disorders/wellbeing; (iii) comorbidity index; (iv) social environment and life events (births/deaths in the household); (v) environmental metrics; and (vi) other administrative cohort information (Table 1).
Key health metrics are (quarterly): Common Mental Health Disorder (anxiety and depressive disorders) and a count of all GP events (extracted from WLGP). The WLGP is collated from clinical information systems in use at each general practice around Wales, and uses Read codes recorded during a GP consultation. Test results are electronically transferred into the WLGP from secondary care systems. To identify people with Common Mental Health Disorders (CMDs), we applied an existing validated prevalence algorithm with high sensitivity to detect cases of CMD (anxiety and depression). 33 We identified people with CMD each quarter when they had either a historical diagnosis(es) currently treated, and/or current diagnoses or symptoms (treated or untreated) from Read codes (detailed in Supplementary  Table S3, available as Supplementary data at IJE online) in their GP record in the WLGP data (Algorithm 10). 33 The algorithm identifies 'current' diagnoses/symptoms as relevant Read codes in the preceding 1-year period. It identifies 'historical' diagnoses through a search for relevant Read codes through the cohort data outside the 'current' period. The length of retrospective data available varied between individuals in the cohort, depending on the length of their registration with a GP supplying data to SAIL. CMD treatment was identified as at least one prescription for an antidepressant, Anonymised Linking Field (ALF) and Residential Anonymised Linking Field (RALF) are individual and household anonymised linking fields, respectively, within the Secure Anonymised Information Linkage (SAIL) Databank. 31,32 anxiolytic or hypnotic in the 1-year current period. 1 We did not include cognitive behavioural therapies or other non-drug treatments in our CMD case definition, as this information was not available in WLGP. The algorithm applied to identify probable cases of CMD has high specificity and positive predictive value for detecting CMD (anxiety and depression) but, as expected, has low sensitivity. 33 We identified adults (16þ years) with CMD in the GP dataset. We refer to people 'having a CMD', but we acknowledge that this only captures those who have sought care for their CMD in primary care. Community prevalence will be significantly higher, because only about one-third of people affected by CMD seek help in primary care. 4 GP-specific events were converted from daily counts to a binary variable and then aggregated to quarterly counts. This eliminated counting multiple test results. Each individual in the cohort also had quarterly measures for Charlson comorbidity index 30 and a count of hospital admissions.

Environmental metrics
GBS exposure within 300 m of each home in Wales was measured yearly from open source satellite imagery. Three variables representing ambient green/blueness were linked to the cohort: • mean EVI (minimum, mean, median, max); • mean Normalized Difference Vegetation Index (NDVI) (minimum, mean, median, max); • coastal and/or inland water (yes/no); We used imagery with less than 20% cloud cover to estimate EVI/NDVI, resulting in 87.7% of homes with full coverage of EVI and NDVI values from 2008 to 2019. Where homes were missing an EVI/NDVI value for a given year, and neighbouring years were available, we imputed these values.
The potential for an individual to access a range of types (Supplementary Table S1) of GBS, along a network of paths and roads within 1600 m of each home, was modelled for 2012 and 2018. Ambient green/blueness, and potential to access GBS, were augmented by survey responses about leisure time visits to outdoor spaces in Wales for the NSW subgroup.
Household-individual data linkage methods created a longitudinal dataset with the potential for a granular temporal examination of the impact of changes in green and blue space on health inequity for individuals. This design is more appropriate than previous studies for inferring causal links. [1][2][3] Cohort members have their home location linked to appropriately synchroniezd environmental data, extracting subsequent health outcomes from their electronic health records. This provides the opportunity to construct natural experiments or pragmatic trials within the cohort 5,6 .

What has it found?
Using a combination of open source environmental and national mapping agency data, we have demonstrated the feasibility of creating individual-level, longitudinal, environment exposure data with national coverage for 2.8 million adults in Wales . Longitudinal linkage of national-level environmental data, for 1.4 million homes with routinely collected electronic health records and socioeconomic data, allows this cohort to be used to assess the impact of a changing environment on subsequent common mental health disorders, wellbeing and other health outcomes. 26 At an individual level, there was little variation in data completeness between those identified as having a CMD at least once and those without having a CMD: 99.9% (n ¼ 816 020) and 99.4% (n ¼ 1 983 590), respectively. At a household level, 92.3% (n ¼ 2 598 211) of the cohort were linked to a home address for every quarter they were in the e-cohort. Individuals were censored during a quarter if no place of residence could be linked, or if their GP did not provide data to the databank. Individuals with at least one CMD episode had 90.4% (n ¼ 739 054) residential data completeness compared with 93.1% (n ¼ 1 859 157) of those without a CMD.
Full environmental data (EVI and NDVI) were linked for 85% of the cohort (n ¼ 2 384 489) for their complete cohort duration. We examined the linkages to check for bias by deprivation and rurality. The percentage of unlinked homes did not increase with deprivation. However, we found that a higher proportion of unlinked homes were in rural areas. We did not find a systematic bias with EVI; mean EVI for unlinked and linked homes were similar (0.3, Table 2).
A total of 29% of the cohort (816 242) sought care for a CMD in general practice between January 2008 and October 2019. A total of 461 728 (16%) people in the cohort had a previously diagnosed CMD for which they sought care in general practice, subsequently entering the e-cohort ('historical diagnosis'). For the more than 300 000 people newly seeking treatment for a CMD from their GP (i.e. who had no 'historical diagnosis', n ¼ 305 779), a larger proportion (14%, n ¼ 43 350) were living in more affluent, greener areas (measured by mean EVI) by the end of their time in the cohort (relative to when they entered the cohort) compared with only 8% (n ¼ 23 795) who were living in deprived areas with less greenery immediately surrounding the home. In contrast, most people (75%, n ¼ 267 446) who had a 'historical' CMD diagnosis and who also had a CMD during the cohort period (2008-19, n ¼ 358 126), lived in greener areas by the end of their time in the cohort.
People living in the most deprived areas had on average less ambient greenness around their home than those living in the least deprived areas (mean EVI 0.25 vs 0.31, respectively, Table 2). The dynamic cohort captures abrupt GBS changes resulting from home moves as well as in situ slower changes in ambient greenness. More than one-fifth (22.6%) of the adult population in the most deprived quintile moved home at least once during the cohort period, with fewer moving in the least deprived (18.7%) and nextleast deprived (18.2%) quintiles (Table 3). Younger people (<30 years old) and those living in the most deprived areas had the highest prevalence of moving at least once during their time in the cohort (48.9% and 22.6%, respectively, Table 3).
We will apply advanced analytical approaches to the longitudinal health and exposure cohort, with the aim of quantifying the impact of GBS on individual-level mental health and wellbeing. 1 The use of routinely collected historical data and established linkage mechanisms allows this e-cohort to be extended, either to include those under 16 years and/or to evaluate the impact of natural environments on further health, social and public health outcomes. Published cohort papers are listed What are the main strengths and weaknesses?
The cohort is subject to minimal attrition due to the inclusion of all GP-registered individuals, unless individuals have opted out by making a request to their GP (see https://saildatabank.com/faq/). This minimizes the potential for selection bias. The cohort currently contains 2 801 483 adults. This will change with further follow-up years because the dynamic e-cohort structure accommodates migration in and out of Wales, as well as deaths and ageing into the cohort (i.e. reaching age 16 years). This large adult population cohort provides sufficient power to examine variations between subgroups to investigate inequalities. We reduced ecological fallacy using privacy-protecting data linkage methods to construct household measures of GBS. 5,6 Longitudinal environmental metrics, and linkage methods, enable an objective assessment of environmental changes, with no research burden for individuals. [34][35][36] A strength of this cohort is the ability to disentangle health outcomes from 'greening gentrification' by anonymously 'tracking' individuals over time. 37 System-wide natural changes may be slowly evolving and so the impact on population health requires longer follow-up. Over a long duration, place-based improvements may displace an area's original population with those who are more affluent and healthier ('gentrification'). Results of place-based intervention studies investigating area-level health effects over long periods of time are therefore likely to record health outcomes of a different, healthier, population.
Like other electronic health records cohorts, the GBS ecohort data are predominantly routinely recorded and lack data on behaviour, some potential confounding factors and outcomes such as wellbeing. There is no health-related quality of life instrument routinely used to assess changes in health status in general practice in Wales. The cohort is largely restricted to detecting changes in outcomes that involve health service use. However, through linkage to survey data, a subset of the cohort has information on wellbeing as well as on behaviours such as time spent visiting GBS (n ¼ 5312 adults).
The validity and reliability of research using routinely collected data depend upon its quality and completeness. Overall, the validity of primary care diagnoses in the UK tends to be high. 38 Case-finding for CMD in routinely collected administrative health data can unobtrusively identify patients for mental health research, including on the effects of intervention. 39 Diagnostic coding can differ between clinicians/practices over time, which may influence the sensitivity and specificity of algorithms to identify patients using a specific case definition in e-cohorts over time. A validation study, comparing using Read codes and algorithms for CMD case-finding (including the algorithm we have used) with the five-item Mental Health Inventory, demonstrated that using diagnosis and current treatment alone to identify CMD using routinely collected GP data would miss a number of true cases, given changes in GP recording behaviour between 2000 and 2010. Including historical diagnoses with current treatment and symptoms, as in this cohort, increases sensitivity. We captured annual ambient exposure to greenness, and temporally matched these to subsequent health outcomes. This improves on previous studies that did not have the data or systems to achieve this. We were unable, however, to continue this with the access metrics because several key data sources were not updated frequently and do not currently capture change in land use consistently. This has created a temporal mismatch between (annual) greenness measures (EVI, NDVI) and access measures (2018), which means we could not allocate a precise period when access to a GBS (new or old) may have changed. We recommend that GBS data providers update data regularly using consistent standards to capture changes in access to, and quality of, GBS through time.
Can I get hold of the data? Where can I find out more?
This cohort is stored and maintained in the SAIL Databank at Swansea University, Swansea, UK. This is a controlled access cohort; all proposals to use SAIL data are subject to review by an independent Information Governance Review Panel. Where access is granted, it is gained through a privacy protecting safe haven and remote access system (SAIL Gateway). The cohort data will be available to external researchers for collaborative research projects after 2022. For further details about accessing the cohort, contact [saildatabank.com] and Sarah Rodgers [ARCNWC@liverpool.ac.uk] for opportunities to collaborate with the original investigator team.

Ethics approval
This cohort is based on routinely collected administrative, environment and survey data. All data will be anonymised into a secure databank, and therefore there will be no mechanism for informing potential cohort participants of possible benefits and known risks. The cohort received approval from an independent Information Governance Review Panel, an independent body consisting of membership from a range of government, regulatory and professional agencies. We obtained informed consent to use the linked and anonymised NSW data within the SAIL databank. All routinely collected anonymised data held in SAIL are exempt from consent due to the anonymised nature of the databank (under section 251, National Research Ethics Committee).

Data availability
See 'Can I get hold of the data?', above.

Supplementary data
Supplementary data are available at IJE online.

Author contributions
S.E.R. designed and led the development of the cohort. D.T. produced the analysis and cohort linkage and drafted the paper with R.G. R.F. and A.M. produced the exposure metrics and reviewed the paper. A.W. provided input on analytical strategy. F.R. and B.W. produced the analysis and linkage for individuals linked to NSW survey and reviewed the paper. R.L., G.S. and A.A. reviewed the paper. All authors contributed to cohort design through input to regular meetings. All authors reviewed the final submitted paper.

Funding
The GBS and Mental Health in Wales cohort was developed as part of independent research funded by the National Institute for Health Research (NIHR), project number 16/07/07, and the UK Prevention Research Partnership, GroundsWell (MR/V049704/1). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.