Cohort profile Cohort profile : Extended Cohort for E-health , Environment and DNA ( EXCEED )

Cohort profile: Extended Cohort for E-health, Environment and DNA (EXCEED) Catherine John,* Nicola F Reeve , Robert C Free, Alexander T Williams, Aliki-Eleni Farmaki, Jane Bethea, Linda M Barton, Nick Shrine, Chiara Batini, Richard Packer, Sarah Terry, Beverley Hargadon, Qingning Wang, Carl A Melbourne, Emma L Adams, Catherine E Bee, Kyla Harrington, José Miola, Nigel J Brunskill, Christopher E Brightling, Julian Barwell, Susan E Wallace, Ron Hsu, David J Shepherd, Edward J Hollox, Louise V Wain and Martin D Tobin Department of Health Sciences, NIHR Leicester Biomedical Research Centre, Department of Respiratory Sciences, University of Leicester, Leicester, UK, Department of Population Science and Experimental Medicine, Institute of Cardiovascular Science, University College London, London, UK, Department of Haematology, University Hospitals of Leicester NHS Trust, Leicester, UK, Leicester Law School, Department of Cardiovascular Sciences and Department of Genetics and Genome Biology, University of Leicester, Leicester, UK

Why was the cohort set up? EXCEED aims to develop understanding of the genetic, environmental and lifestyle-related causes of health and disease. Cohorts like EXCEED, with broad consent to study multiple phenotypes related to onset and progression of disease and drug response, have a role to play in medicines development, by providing genetic evidence that can identify, support or refute putative drug efficacy or identify possible adverse effects. 1 Furthermore, such cohorts are well suited to the study of multimorbidity.
Multimorbidity describes the presence of multiple diseases or conditions in one patient, though definitions in the literature vary widely. [2][3][4] It demands a holistic approach to optimize care and avoid iatrogenic complications, such as drug interactions. In the context of increasing specialisation of many health care systems and high health care use among people with multimorbidity, providing such care poses a complex challenge. [5][6][7] In high-income countries, multimorbidity is particularly common among more deprived socioeconomic groups and may even be considered as the norm amongst older people, 8,9 and an ageing global population and a growing burden of non-communicable diseases in low-and middle-income countries compound its global importance. 10 An expert working group convened by the UK Academy of Medical Sciences recently highlighted the lack of available evidence relating to the burden, determinants, prevention and treatment of multimorbidity, and recommended the prioritisation of research on multimorbidity spanning the translational pathway from understanding of its biological mechanisms to health services research. 11 Studies designed to investigate multimorbidity, rather than considering individual conditions in relative isolation, are therefore vital. 6,7 Linkage to electronic health records (EHR) has enabled information on a broad range of diseases and risk factors to be studied in EXCEED and places multimorbidity at the study's heart. The EHR linkage also facilitates longitudinal follow-up over an extended period, enabling, for example, the investigation of lifestyle factors and other exposures on healthy ageing and outcomes in later life.
Combining wide-ranging data from EHR with genomewide genotyping is also central to EXCEED's purpose. In recent years, our understanding of which genes are associated with both rare and common diseases has advanced rapidly as available sample sizes for genome-wide association studies (GWAS) have grown rapidly. 12 For example, there are now 279 genetic variants associated with lung function and chronic obstructive pulmonary disease (COPD). [13][14][15][16][17][18][19][20][21][22][23][24][25] However in many cases, our understanding of the mechanisms through which these variants influence disease risk-and which could therefore be therapeutic targets-is relatively limited. An efficient design to inform this understanding is to stratify participants based on available study data on their health status (phenotype) or genetic risk factors (genotype), to thus recall them for further detailed investigations which would be impracticable across a whole cohort. EXCEED was purposely designed as a resource for recall-by-genotype sub-studies, and all participants have consented to be recalled on this basis.
The study is led by the University of Leicester, in partnership with University Hospitals of Leicester NHS Trust and in collaboration with Leicestershire Partnership NHS Trust, local general practices and smoking cessation services.

Who is in the cohort?
Recruitment to the cohort has taken place since 22 November 2013, primarily from the general population through general practices in Leicester City, Leicestershire and Rutland. In total, 10 156 participants have been recruited to 4 December 2018. Of these, 445 were recruited through smoking cessation services in Leicester City, Leicestershire and Rutland, 44 through targeted recruitment of those with a recorded diagnosis of COPD in their electronic primary care record and 117 through additional community-based recruitment focused on Leicester's South Asian communities (see Figure 1). Although a recruitment target of 10 000 has now been reached, community-based recruitment particularly focusing on minority ethnic groups will continue subject to further funding.
All tables in this paper present participants recruited via primary care or smoking cessation, whose data were collected and quality control undertaken at 4 December 2018 (9384 participants for questionnaire data, 8930 participants for primary care data). Quality control for the remaining questionnaires and linkage to primary care data are ongoing. Around 400 participants do not have questionnaire data but were recruited as consent and a saliva sample were provided.
In the UK, over 98% of the population is registered with a National Health Service (NHS) general practitioner. 26 For recruitment through primary care, all registered patients aged between 40 and 69 years in participating general practices were eligible for recruitment. Exclusion criteria were minimal: those receiving palliative care, those with learning disabilities or dementia and those whose records indicated they had declined consent for record sharing for research. All eligible patients identified through primary care were sent an initial letter with brief information about the study and a reply slip to indicate their interest.
For participants recruited via smoking cessation services, the lower age limit was reduced to 30 years because of the higher risk of disease among smokers. Initial eligibility screening and information provision were either undertaken through electronic client records followed by a letter to the client (as in primary care) or face-to-face by a smoking cessation adviser during a routine appointment. Additionally, patients with a recorded diagnosis of COPD were invited from four local general practices with a higher prevalence of COPD, to boost the numbers available for a sub-study of respiratory disease. For this group, the lower age limit was 30 years, and all other exclusion criteria were identical to the main primary care recruitment.
All those who responded to indicate they were interested in taking part were sent full written information on the study, in addition to a study consent form. Full information regarding participant consent can be found at [http://www.leicsrespiratorybru.nihr.ac.uk/our-research/ our-research-studies/exceed]. All participants consent to follow-up of their electronic health care records for up to 25 years, to storage and analysis of their DNA sample and to being contacted for further studies on the basis of their genetic data (recall-by-genotype) or health status (recallby-phenotype). They may also consent or decline to be contacted regarding genetic variants which may, in the future, be considered clinically relevant.
Participants proceeded via one of two routes depending on their location and personal preference: a face-to-face appointment with a research professional, or by post. The flow of participants through the main primary care recruitment route is illustrated in Figure 2. Approximately 8% of those who received an initial invitation via primary care completed recruitment. Table 1 gives an overview of the demographic characteristics of the primary care population sampled, compared with the characteristics of those recruited to the study via primary care. Participants in the   Primary care population is all patients within the eligible age range in the practices sampled, and includes those who were excluded at the next step (codes for palliative care, dementia, learning disability, or lack of consent to share data for research). study were older and more likely to be female than the primary care population from which they were drawn. This reflects well-known patterns of participation in similar cohorts. 27,28 The local primary care population includes a large proportion of minority ethnic groups, especially Asian and Asian British. These groups are under-represented among study participants, although the proportion of study participants of Asian and Asian British ethnicity (5%) is higher than many UK cohorts, including UK Biobank. 28 This reflects experience of similar recruitment methods in other studies. 29 Explanations for the under-representation of minority ethnic groups in medical research more generally include language barriers, inequitable access to health care services, cultural sensitivities and a lack of awareness of medical research and its purpose. 30,31 Community-based recruitment has been introduced to EXCEED to improve representation of these groups.

How often have they been followed up?
Participants have consented to follow-up through linkage to EHR for up to 25 years. Linkage to electronic primary care records (i.e. records from the participant's general practice) is undertaken upon completion of recruitment at each practice and has been completed for 8930 participants to 4 December 2018.
As participants are prospectively followed up, we expect losses due to deaths (to date less than 1% of participants), withdrawals (to date less than 0.1% of participants), relatively few losses due to house moves within the UK or changing general practitioner (as NHS patients retain the same NHS number and their electronic records move with them) and some losses due to emigration. Analyses of historical health care records to track disease development and progression may be subject to selection bias, in particular survivor bias.

What has been measured?
There are several phases of data collection, summarised in Table 2. Linked primary care data provide historical cohort data. Since the mid-1990s, prospectively recorded consultations enable the retrieval of information not only on symptoms for which participants have visited their general practitioner and diagnoses which have been made, but also on examination findings (including blood pressure readings and spirometry results, for example), laboratory test results, drug prescriptions and secondary care referrals. Major diagnoses documented on paper records before the mid-1990s were retrospectively coded at the time of computerization and so can also be retrieved.
Baseline data collection for all participants included a self-completion questionnaire which collected detailed information on current and past smoking habits, smoking cessation attempts, e-cigarette and shisha usage, environmental tobacco smoke (second-hand smoke) exposure and alcohol use. For those recruited via a face-to-face appointment, this was undertaken during the appointment. Those participating by post completed the questionnaire online using their own computer, with a paper version available if necessary. Height, weight and waist circumference were either measured by a research professional or self-reported by postal participants. Those recruited face-to-face also had their hip circumference measured and, where feasible, underwent spirometric measurement of lung function.
Finally, a saliva sample was collected from all participants either at their appointment or returned by post, for extraction of DNA. The samples are stored at the NIHR Biocentre (Milton Keynes, UK), providing industrial-scale laboratory information management and automated robotic systems which have been shown to facilitate efficient error-free sample storage and extraction from freezers in the UK Biobank study. 32 To date, genome-wide genotype data (using the Affymetrix UK Biobank Axiom Array) are available for 5216 participants after quality control, enabling analysis of over 40 million variants after imputation to the Haplotype Reference Consortium (HRC) panel. 33 Planned updates to linked primary care records will enable longitudinal tracking of health. There is also ongoing linkage to other sources of health data including: admissions, accident and emergency attendances and outpatient appointments via hospital episode statistics; pathology data (East Midlands Pathology Service); and the Myocardial Ischaemia National Audit (MINAP). Table 3 shows that, in general, our cohort is slightly healthier than average for common health risk factors and behaviours. This is similar to findings by other cohort studies. 28 For example, the total proportion of participants who were overweight, obese or morbidly obese (64.2%) was slightly lower than similar age groups in Health Survey for England 2016, where it was above 70% for all ages from 45 upwards. 34 Similarly, the proportion of EXCEED participants who currently smoke is 9.7%, considerably lower than the national average (15.8%) and comparable only to the oldest age group (65 and over) in the national Annual Population Survey, among whom smoking prevalence was 8.3%. Smoking prevalence among all younger age groups nationally is 15% or above. On the other hand, the proportion of people who report never smoking is also lower than in national population surveys. This may be influenced by question wording and interpretation: whereas the relevant national survey asked if people had ever 'regularly' smoked, the EXCEED questionnaire included occasional use in the definition of ever smokers. 35 Table 4 presents more detailed information on smoking habits among current and ex-smokers. The vast majority of both current and ex-smokers reported smoking cigarettes, but cigar/cigarillo and pipe smoking was less common amongst current than ex-smokers. Alcohol intake for our cohort is comparable to that of similar age groups in Health Survey for England 2016. 36 Only 25.2% of participants are in the two most deprived national quintiles and 29.3% are in the least deprived quintile. For Leicester City, 75.9% of the population are in the two most deprived quintiles and only 1.4% are in the least deprived quintile. 37 Though this reflects the whole Leicester population, not just those aged 40-69 and registered with the GP practices that agreed to take part in EXCEED, it indicates that the most deprived communities are under-represented in the cohort.

What has it found? Key findings and publications
The Quality and Outcomes Framework (QOF), introduced in 2004, aims to improve the quality of care patients  Table 6. We found that, overall, 27.2% of our participants had a recorded diagnostic code for more than one QOF condition. This is in line with findings from a large study of almost 100 000 individuals in the Clinical Practice Research Database by Salisbury and colleagues, who used a similar approach to define multimorbidity. 5 They found that 16% of their population had a code for more than one QOF condition, but this rose sharply with age, reaching around 20% among 55-to 64-year-olds and over 30% in 65-to 74-year-olds. Two further large UKbased studies have used more comprehensive lists of conditions to define multimorbidity, but limited their focus to active morbidity only, and found prevalence of multimorbidity between 23.2% and 27.2% across all ages, rising substantially with age to 50% or more among 65-to 74year-olds. 8,38 We specifically examined primary care diagnoses of one condition, COPD, for which we had independent People may use more than one type of tobacco, so percentages will not add up to 100. b Filtered, unfiltered and hand-rolled. c Only for quit attempts lasting at least 6 months. Denominator for percentages is current smokers who have made a quit attempt lasting at least 6 months, or total number of ex-smokers. People may have used more than one aid, so percentages will not add up to 100. d Only for cigarette smokers.   Table 5). b Of participants with primary care data. diagnostic information from baseline spirometry. Diagnosis of COPD defined by presence of a COPD code in primary care data compared with COPD defined by baseline spirometry results indicates that there is underdiagnosis of COPD in EXCEED participants: 84.8% of those with GOLD stage 1-4 COPD and 71.9% of GOLD stage 2-4 were undiagnosed (Table 7). Existing estimates of the proportion of COPD which is undiagnosed range from around 60% to over 80%, depending on the setting and population studied. [39][40][41][42][43] Using comparable methodology in a similar population to ours, Shahab and colleagues found that 81.2% of those with spirometric COPD had no respiratory diagnosis at all and over 95% had not been diagnosed with COPD. 44 The slightly lower level of underdiagnosis in EXCEED (84.8% of those with spirometric COPD had not received a COPD diagnosis) may be partially attributable to recent improvements in case-finding, and to our use of primary care records rather than self-report to define diagnoses. Reasons posited for such extensive underdiagnosis include a perception among clinicians that COPD is solely a disease of elderly smokers, pessimistic views of treatment, lack of availability or underuse of spirometry, 43,45 and the unreliability of self-reported smoking status in clinical practice. 44 To demonstrate the utility of EXCEED for enabling cross-sectional or longitudinal studies of quantitative traits, we examined some of the most common measures available in the primary care data and the numbers of participants with one or more recordings of these measures (Table 8). Table 9 shows the average values of these measures. 46 For example, 98.0% of participants have two or more recordings of blood pressure and over 90% have four or more recordings. Mean systolic blood pressure was 129.9 [standard deviation (sd) 13.9] and mean diastolic blood pressure was 78.3 (sd 8.6) ( Table 9).

Recall-by-phenotype study
Recalling by phenotype (see Figure 3) facilitates in-depth study of disease mechanisms, with a reduced risk of bias as with nested case-control studies. 47 One such sub-study has recalled EXCEED participants to take part in a study examining the microbiome in COPD cases and in smoking and non-smoking controls.

Potential for recall-by-genotype studies
Future recall-by-genotype studies (Figure 3) are expected to contribute to a deeper understanding of genetic variants which may be potential therapeutic targets, by bringing back participants for detailed assessments on the basis of the known or suspected mechanism of the relevant gene. Such recall-by-genotype sub-studies may investigate disease susceptibility, disease progression or drug response, and though they could be interventional in design, most will be observational studies. 48 Observational studies of this kind can provide evidence which is not susceptible to reverse causation and to confounding by lifestyle factors, given Mendelian randomization. 49 Nested designs are also feasible, which do not rely on recall of participants but which could be undertaken quickly and inexpensively using stored biological samples  and linked electronic data, and such sub-studies could select samples based on either phenotype or genotype. Smallscale intervention-by-genotype studies could, for example, evaluate response to a treatment with a known safety profile in participants with a specific genetic variant.
What are the main strengths and weaknesses?
Linkage to EHR is a key strength of EXCEED, enabling the study of a wide range of risk factors and diseases, even where data have not been specifically collected at baseline or precede enrolment as a study participant. UK general practice has had over 20 years of near-universal computerized records. 50 These records have been further enhanced with the introduction of the QOF in 2004, which incentivized GPs to keep comprehensive records of several chronic diseases. 51 Some of these indicators incentivize the recording of quantitative traits relevant to the chronic disease diagnosed, such as blood pressure, lung function, estimated glomerular filtration rate, glycated haemoglobin (HbA1c) and cholesterol measures. That these are expected to be recorded approximately annually means that registered patients often have many repeat measures within linked EHRs, providing an excellent opportunity to study trends in control of conditions such as hypertension or progression of diseases such as COPD. Previous studies have validated some of these primary care measures-for example, routinely recorded spirometry has shown good validity when compared with study specific measures. 52 Other more complex longitudinal outcomes-for example, related to healthy ageing-can also be measured using EHRs. The use of EHR can have limitations. Misclassification and miscoding of diagnoses may occur, and it is particularly likely that the true prevalence of many diseases will be underestimated (the 'clinical iceberg'), as demonstrated by a comparison of COPD diagnoses in primary care data in EXCEED with COPD from spirometry (Table 7). However, the availability of repeat recordings and multiple types of data (including examination findings, pathology results and onwards referrals) over a long period of time can be used to improve and validate the classification of diagnoses and other important exposures and outcomes. Many disease definitions have been validated already-for example, definitions of COPD and asthma in the GOLD-CPRD database-and EXCEED will contribute further to this important area of study. [53][54][55][56][57][58][59] In addition to disease status validation, combining records of drug prescriptions and diagnostic and symptom codes can be used to define complex phenotypes that it has not been possible to study previously.
The cohort recruited adults aged between 30 and 69 years, mostly aged 40 or over, and therefore permits the study of a wide range of questions pertaining to health and disease in adulthood. The absence of younger participants renders it less suitable to study the evolution of disease before age 40. This is mitigated by the availability of linked health care data from EHR. These records include data prospectively coded by primary care practitioners from the mid-1990s onwards, and will therefore include extensive data from early adulthood for those in middle age when recruited. Earlier life events, for example in childhood and adolescence, are likely to be captured only when they were transferred from paper to electronic health care records in the early 1990s, and will therefore include major events such as childhood pneumonia but not more minor illnesses. The very elderly are also currently absent from the cohort, but data will become available through follow-up of those recruited at the older end of the age spectrum. More generally, the cohort's age, sex and ethnicity distribution will influence generalizability of research findings to other population groups, and for some research questions, validation in other cohorts may be required.
Minority ethnic groups, notably Leicester's Asian and Asian British population, are currently under-represented in EXCEED. This reflects the recruitment methods used to date. We have extended recruitment to increase minority ethnic participant numbers and have adapted our recruitment methods to achieve this, for example by undertaking recruitment at community events. Minority ethnic groups are also substantially under-served in the availability of samples with genomewide genotype data worldwide. Although the situation has improved in recent years for Asian populations, only 14% of individuals included in genome-wide association studies worldwide up to 2016 were from Asian backgrounds. 60 This situation is replicated in UK-based studies. In UK Biobank, only 2% of participants are from Asian or Asian British ethnic groups, despite this group representing around 7% of the UK population. It is essential that representation of minority ethnic groups increases substantially in genomic studies if these communities are to realise the benefits of genomically-informed advances in precision medicine. EXCEED aims to contribute towards this important goal.
The utility of combining EHR and genetic data for efficient and flexible genetic studies has been highlighted by the eMERGE network of biobanks and Geisinger MyCode. 61,62 The comprehensive nature and nearuniversal coverage of NHS health records adds further strength to this study design. In particular, the ability to capture virtually all primary and secondary care contacts over decades of the lifespan enables longitudinal studies with a depth of data available in relatively few studies.
Strengths of the study also include consent from all participants to be contacted to participate in recallby-genotype studies, a type of consent which is not yet widely sought in cohort studies. Recall-by-genotype studies are expected to be highly valuable to identify and validate drug targets and to inform targeting of therapeutics in a precision medicine approach. 48 Maintaining the engagement of cohort participants is important for such studies. Potential strategies include incentivizing and/or reducing barriers to further participation, building a study 'community' through study branding, newsletters and events, and efforts to trace participants where contact details have lapsed. 63,64 In EXCEED, in addition to a study newsletter, we are devising approaches with our patient and public involvement (PPI) group, including planned focus groups on dynamic consent approaches. 65 Some studies incorporating genetic analyses (such as Genomics England) actively seek clinically actionable variants, whereas most cohort studies may not seek to identify these but may discover them as incidental findings. Anticipating this potential, at the time of consent we asked whether participants would wish to be notified about clinically actionable variants; 99.5% of participants stated that they would wish to be informed in this situation. Clinically actionable variants will be discussed with the regional clinical genetics department of University Hospitals Leicester NHS Trust and then reported back to participants on request for NHS validation. Understanding the reasons for participants' preferences, how these change over time and how these can best be supported by future policies and procedures, will be of key importance for EXCEED and other longitudinal cohort studies.
Can I get hold of the data? Where can I find out more?
Participants have consented to their pseudonymized data being made available to other approved researchers, and we welcome requests for collaboration and data access. Access to the resource requires completion of a proposal form, including a lay summary of the proposed research. Applications to access the resource will be assessed for consistency with the data access policy and with the guidance of the Scientific Committee, which has participant representation. Access to the data will be subject to completion of an appropriate Data/Materials Transfer Agreement and to necessary funding being in place. Requests to collect new data or to use biological samples may be subject to additional requirements. Interested researchers are encouraged to contact the study management team via exceed@le.ac.uk.