Cohort Profile: East London Genes & Health (ELGH), a community based population genomics and health study in people of British-Bangladeshi and -Pakistani heritage

East London Genes & Health (ELGH) is a large scale, community genomics and health study (to date >30,000 volunteers; target 100,000 volunteers). ELGH was set up in 2015 to gain deeper understanding of health and disease, and underlying genetic influences, in people of British-Bangladeshi and -Pakistani heritage living in east London. ELGH prioritises studies in areas important to, and identified by, the community it represents. Current priorities include cardiometabolic diseases and mental illness, these being of notably high prevalence and severity. However studies in any scientific area are possible, subject to community advisory group and ethical approval. ELGH combines health data science (using linked UK National Health Service (NHS) electronic health record data) with exome sequencing and SNP array genotyping to elucidate the genetic influence on health and disease, including the contribution from high rates of parental relatedness on rare genetic variation and homozygosity (autozygosity), in two understudied ethnic groups. Linkage to longitudinal health record data enables both retrospective and prospective analyses. Through stage 2 studies, ELGH offers researchers the opportunity to undertake recall-by-genotype and/or recall-by-phenotype studies on volunteers. Sub-cohort, trial-within-cohort, and other study designs are possible. ELGH is a fully collaborative, open access resource, open to academic and life sciences industry scientific research partners.


Cohort profile in a nutshell
• East London Genes & Health (ELGH) is a large scale, community genomics and health study (to date >30,000 volunteers; target 100,000 volunteers).
• ELGH was set up in 2015 to gain deeper understanding of health and disease, and underlying genetic influences, in people of British-Bangladeshi and -Pakistani heritage living in east London.
• ELGH prioritises studies in areas important to, and identified by, the community it represents. Current priorities include cardiometabolic diseases and mental illness, these being of notably high prevalence and severity. However studies in any scientific area are possible, subject to community advisory group and ethical approval.
• ELGH combines health data science (using linked UK National Health Service (NHS) electronic health record data) with exome sequencing and SNP array genotyping to elucidate the genetic influence on health and disease, including the contribution from high rates of parental relatedness on rare genetic variation and homozygosity (autozygosity), in two understudied ethnic groups. Linkage to longitudinal health record data enables both retrospective and prospective analyses.
• Through stage 2 studies, ELGH offers researchers the opportunity to undertake recall-by-genotype and/or recall-by-phenotype studies on volunteers. Sub-cohort, trial-within-cohort, and other study designs are possible.
• ELGH is a fully collaborative, open access resource, open to academic and life sciences industry scientific research partners.

Why was the cohort set up?
East London Genes & Health (ELGH) commenced recruitment in April 2015 as a community based, long term study of population health and disease in people of British-Bangladeshi and -Pakistani heritage in east London. ELGH uses a novel population-based design incorporating cutting-edge genomics with high-quality electronic health record data linkage and targeted genotype-based-recall studies in currently >30,000 volunteers with funding to expand to 100,000 volunteers by 2023. ELGH is an open access data resource, building on the expertise of studies including UK Biobank, and has been designed to generate new knowledge related to the health and disease of an population at high need, and to redress the poor representation of non-White ethnic groups in existing population genomic cohorts 1 .
Almost a quarter of the world's population is of South Asian origin with over 3 million in the UK, representing 5% of the UK population 2 . The risk of coronary heart disease is 3-4 times higher, and type 2 diabetes 2-4 times higher in UK South Asians compared with Europeans 3,4 . Understanding the mechanisms underlying these ethnic differences will provide important insights into the aetiology of cardiometabolic diseases to inform new approaches to treatment and prevention, and help to reduce ethnic inequalities.
The setting for ELGH, in east London, incorporates one of the UK's largest South Asian communities (29% of a total population of 1.95 million people across its 8 local authorities), of which 70% are people of British-Bangladeshi and -Pakistani heritage. This population lives in high levels of deprivation (Tower Hamlets, Hackney, Barking and Dagenham are the 9 th , 10 th and 11 th most deprived local authorities in England) 5 and it experiences disproportionately adverse health outcomes, especially relating to cardiometabolic health and its complications. Compared to White Europeans, South Asians living in east London have a two-fold greater risk of developing type 2 diabetes (16.4% vs. 7.5%) 6 ; and faster progression of chronic kidney disease in those with diabetes 7 , nearly double the risk of non-alcoholic liver disease 8 , and over double the risk of multimorbidity including cardiovascular disease 9 , and the onset of cardiovascular disease occurs 8 years earlier in men (60.4 years compared to 68.2 years) 9 . Determinants of poor cardiometabolic health are also noted to start early in the life course, with 35-44% of 10-11 year old children overweight/obese in east London boroughs, above the UK average of 33% 5 . Deprivation, ethnicity related ill-health, and inequalities all combine to lead to worse health outcomes.
A key feature of ELGH is the opportunity to obtain, and link, high quality individual level data from routine, real world, longitudinal (retrospective and prospective) clinical data sources to genomic data, with the ability to recall for further research studies. East London has an extensive track record of utilising routine clinical health care data (predominantly from primary care) in research studies [6][7][8] . Routinely collected health record data is of high quality, and electronic performance dashboards are embedded (and sometimes incentivised) in UK clinical practice, facilitating both high quality clinical care and disease monitoring 10 . However clinical data is different to traditional epidemiological researcher-collected (or participant-recorded) data, and presents different challenges in using it to measure health outcomes. These challenges are being addressed both internationally (e.g. the SNOMED collaboration) and within the UK by e.g. establishment of the new Health Data Research UK. Interestingly, population-based disease risk screening programmes, such as the National Health Service (NHS) Health Checks, are widely taken up in East London and, notably, show greatest uptake in people of South Asian ethnicity and from the most deprived quintiles 11 .
ELGH volunteers report high rates of parental relatedness (e.g. are offspring of first cousin parent marriages, Table 1 ) which leads to genomic regions of autozygosity at the DNA level, such that rare allele frequency variants normally seen as heterozygotes can be observed in the homozygous (autozygous) state. Whilst the effects of autozygosity is well studied in paediatrics and rare disease, there is much less previous research on any health effects 12,13 in adults recruited from population settings.
ELGH supports the health needs and priorities of the local community, and fosters authentic, long term engagement in its research, so that benefits to health can be delivered in the future. A community advisory group is embedded into the high level strategic management of ELGH and has helped prioritise areas for research, including type 2 diabetes, cardiovascular disease, dementia and mental health. The community focused ethos extends to a wide range of meaningful public engagement and dissemination activities, including collaboration with the award-winning Centre of the Cell science and health education facility 14 .

Who is in the cohort?
ELGH (see Figure 1 ) incorporates population-wide recruitment to Stage 1 studies, and targeted recruitment to Summary data from the stage 1 volunteer questionnaire and electronic health record data linkage are summarised in Table 1 , and includes both baseline characteristics and data captured from longitudinal electronic health records. Basic demographics (age group and sex) of ELGH volunteers are compared to population-wide data in Figure 2 . The comparison of the ELGH to the background population highlights that the 'convenience sampling' approach to volunteer recruitment in ELGH is obtaining a sample that is broadly representative of the background population, with regards age and sex, but which modestly favours recruitment of women over men in those <45 years. Data in Table 1 also indicates that ELGH volunteers live in areas of high deprivation (97% live in the most deprived 2 quintiles, using the Index of Multiple Deprivation). ELGH volunteers have a high proportion of common medical conditions, including type 2 diabetes (22%), hypertension (17%), ischaemic heart disease (5%) and asthma (11%), reflective of their prevalence in the background population (5%, 9%, 2%, 5%, respectively) but also with evident over-sampling of people with chronic

How often have they been followed up?
ELGH contains real-world data with data collection triggered by a broad range of clinical encounters, including routine health checks, chronic disease management, inpatient hospital admissions, surgery, maternity care and emergency care.
Primary health care records in east London were digitised around 2000 and offer a rich source of data on clinical encounters since then, but also including the dates of diagnoses pre-digitisation (e.g. type 2 diabetes, diagnosed in 1992) and summarised prior clinical events.
Health data extraction and linkage takes place 3-monthly and ELGH volunteers have consented for life long access to electronic (and paper) health records (primary care, secondary care, community and mental health, national NHS datasets and registries), facilitating long term prospective follow-up.

What has been measured?
Stage 1 incorporates the following procedures and data collection (summarised in Table 2 ):

• Participant questionnaire
Participant stage 1 questionnaire (see supplementary file 1 ): This self-report questionnaire collects brief data on all participants including: name, date of birth, sex, ethnicity, contact details, diabetes status, parental relatedness (consanguinity, manifest as autozygosity at the genomic DNA level), and an overall assessment of general health and wellbeing. This short questionnaire has been deliberately designed and minimised to facilitate high throughput recruitment and inclusivity of groups where language and cultural differences exist, and to be used with or without researcher assistance, thereby maximising the representativeness of our population sample.
Health record data is obtained by linkage to a participant's NHS number (available for >99% of volunteers), either recorded at the time of recruitment, or at later look-up, and including an NHS number validation step (a check digit).

• NHS primary care health record data linkage
A data extraction template is used to extract relevant fields from electronic health record systems (both primary and secondary care data sources are accessible). Whilst raw data is potentially available, at present we are restricting analyses to directly curated research phenotypes and growing these both incrementally and on demand. The data extraction template comprises bespoke search terms, including SNOMED, ICD10 and READ (including READ2 and CTV3) diagnostic codes, prescribing data, laboratory test results and clinical measurements and processes (an example relating to type 2 diabetes is given in supplementary file 2). Electronic health record data are of high quality, and is particularly rich in settings where routine data collection is standardised and incentivised, such as by the Quality and Outcomes Framework used in NHS primary care. All electronic health record data are cleaned and checked prior to analysis. Data concordance was checked between participant questionnaire and electronic health record, with >99% concordance for gender and year of birth. Technical errors with optical character recognition (dates of birth) or user data completion (e.g. a Mr with a male first name ticking female) explained almost all cases of discordance, and were resolved with manual checking in the final analysis datasets. Data outside clinically plausible ranges (e.g. primary care-measured systolic blood pressure <60mmHg or >250mmHg, diastolic blood pressure <30mmHg or >200mmHg), or with clear data entry errors (e.g. height recorded as 167 metres instead of 1.67 metres) are removed. For the purposes of this data summary, anthropometric measurements recorded historically in the electronic health records were only used if the volunteer was >16 years old at the time of measurement. Missing data exist, but at relatively low frequency in routinely collected and incentivised clinical measures, e.g. smoking status has been recorded in the primary care record of 90% of ELGH volunteers in the 5 years prior to the most recent data linkage. Repeated measures of routinely collected data, and cross-validation across information sources can mitigate the impact of missing data where it exists, and statistical techniques, such as sensitivity analysis (for missing-not-at-random data) and multiple imputation (for missing-at-random) will be required in data analysis 16 .
Electronic health record data represents a 'living' dataset allowing both retrospective, cross-sectional and prospective data collection.

• NHS secondary care health record data linkage
Barts Health NHS Trust is the major network of secondary care hospitals across in East London, and is the UK's largest NHS Trust. Currently, NHS number of ELGH volunteers are linked to the Barts Health data warehouse, containing clinician-coded SNOMED acute and chronic problem lists, laboratory results, pathology results, imaging results, and ICD-10 clinical coding which is used at every finished episode of care. Data is available for all ELGH volunteers who have attended the Barts Health hospital system -at the last linkage, this included 8720 ELGH volunteers. As an exemplar, maternity data linkage within Barts Health identified 2402 female ELGH volunteers with maternity records available for at least one pregnancy (2972 live single live births, 27 twin/multiple live births and 17 stillbirths). We intend to expand secondary care data linkage to other local East London NHS Trusts in 2019.

• Planned linkage to health record datasets
In 2019, East London Genes & Health will link to further datasets, including: Other potential data linkages in the future include national cancer datasets (National Cancer Registration and Analysis Service, NCRAS) and cardiovascular disease audits managed by the National Cardiovascular Outcomes Research (NICOR).

• Genomics
DNA is extracted from Oragene (DNA Genotek) saliva system and stored from all Stage 1 participants.
To date, low/mid depth exome sequencing has been performed (n=3781, data available) or is in progress (n=1492) on those participants reporting parental relatedness in the participant questionnaire.
In late 2018/2019 (funding secured) 50,000 samples from all stage 1 volunteers will be genotyped on the Illumina Infinium Global Screening Array v2.0 (with additional 46,662 Consortia defined multi-disease variants) 19 . Array content includes variants selected for rare disease mutations, from large exome sequencing projects, pharmacogenomics and for genome wide coverage, enabling association studies, polygenic risk score, Mendelian randomisation studies.
In 2019/2020, if support is secured from an evolving Life Sciences Industry Consortium, high-depth exome sequencing will be performed on up to 50,000 volunteer samples.
By 2023, the intention (subject to funding) is for both genotyping and high depth exome sequencing will be performed on up to 100,000 volunteer samples.

• Samples for other -omics
ELGH takes "core" study samples from all volunteers recalled for stage 2 or later stage studies, including a blood cell pellet (for DNA, protein, and other assays), plasma aliquots, and a blood cell RNA preservation tube (Paxgene) to enable further studies including methylation assays, transcriptomics, proteomics, lipidomics and metabolomics.

What has it found: key findings and publications?
East London Genes and Health is a new resource that continues to grow, and to date has been used for three main areas of work:

• Characterisation of common phenotypes
Using Type 2 diabetes as an exemplar condition, we show the feasibility of the ELGH study design to generate high quality electronic health record data for phenotypic characterisation of volunteers ( Table 3 ) . Of 19165 participants in ELGH with available linked electronic health record data, 4312 (22%) participants have a diagnosis of Type 2 diabetes (T2D) in their primary care record. Basic sociodemographic data (age, gender, ethnicity) of participants was recorded in 100%, and smoking status in 96% of these volunteers had been obtained within 2 years of the most recent data linkage (in 2018). Country of birth was incompletely recorded, but at least half of ELGH participants with T2D were born in Bangladesh or Pakistan. Real world clinical data recorded is of high quality, with body mass index, markers of glucose control (HbA1c) and serum cholesterol measured within 2 years prior to participating in ELGH in at least 96% of participants with T2D. The high uptake of routine care processes and high quality data capture from these shows the potential for ELGH to study participants in cross-section at study entry from electronic health record data. Hypertension, ischaemic heart disease and chronic kidney disease were observed in 47%, 15% and 10% of the 4312, and erectile dysfunction was present in 26% of men. Retinal complications of T2D are recorded and graded in the electronic health record, with 35% of ELGH participants having retinopathy and/or maculopathy in screening undergone within the last 2 years. Prescribing data is available on all ELGH volunteers with T2D, showing recent insulin prescriptions in 16%, and the use of single or multiple non-insulin agents as well as drugs for the prevention of cardiovascular disease (e.g. lipid lowering therapy).
Data summarised from ELGH participants with T2D also shows the potential for ELGH to study phenotypic traits longitudinally, both retrospectively and prospectively. The median duration of T2D in ELGH participants was 9 years (range 0-50 years) with electronic health record data available during this time. All volunteers with T2D had a year of onset of the condition recorded, and clinical measurements (including body mass index, HbA1c and serum cholesterol) at the time of diagnosis (+/-6 months) was available for nearly two-thirds of participants. Historic prescribing data was available for similar proportions of participants (data not shown). A diagnosis of pre-diabetes has been made prior to diagnosis of T2D in 23% (993) of these individuals, and 16% (350) of women had a prior diagnosis of gestational diabetes, with clinical data available during at these times, highlighting the potential to obtain longitudinal data to inform prevalence and progression of disease states within ELGH volunteers.
Multimorbidity is an increasing problem in populations with high rates of chronic long-term conditions and ageing, in the ELGH population we identified a high rate of cardiovascular multimorbidities (including hypertension, stroke, ischaemic heart disease, heart failure, atrial fibrillation, chronic kidney disease stage 3+, advanced diabetic retinopathy) associated with type 2 diabetes. Only 14% of ELGH volunteers with type 2 diabetes (n=4312) had this as a single condition; 30% had 2 cardiovascular multimorbidities, 27% had 3, and 29% had 4 or more ( Table 3 ).
The high quality of the electronic health record data available reflects the robust clinical care systems and incentivised data collection methods 10 used in east London. These data, and the consent procedures facilitating lifelong access, will provide an invaluable longitudinal data resource to facilitate genomic studies (e.g. linking phenotype to rare gene variants in Stage 2 recall studies), future at-scale population studies of common phenotypes (Stage 3) and intervention studies (Stage 4).
• Rare allele frequency gene variants occurring as homozygotes, including predicted loss of function knockouts.
The British-Bangladeshi and -Pakistani populations of east London have high rates of parental relatedness (ELGH volunteers self-report ~20%). All those volunteers self-reporting parental relatedness have been selected for exome sequencing. Genomic autozygosity (homozygous regions of the genome identical by descent from a recent common ancestor) means that rare allele frequency variants normally only seen as heterozygotes are enriched for homozygote genotypes. We and others previously investigated the health and population effects of such variants, with a focus on predicted protein loss of function variants 12,20,21 , in smaller samples of Pakistani ethnicity, and we now expand the datasets with the ELGH study.
To inform analyses using self-reported parental relatedness, we tested the accuracy of this self reported trait to actual autozygosity measured at the DNA level by exome sequencing ( Figure 3 ). We find that whilst self-reported parental relationship is a modest predictor of actual autozygosity, for example 8.2% of individuals who declare that their parents are not related in fact have >2.5% genomic autozygosity. We find that for British-Bangladeshi subjects mean autozygosity is slightly lower than expected given the reported parental relationship (possibly due to confusion over the meaning of e.g. "first cousin" versus "second cousin"), whereas for British-Pakistani subjects mean autozygosity is slightly higher than expected (possibly due to historical parental relatedness).

• Recall by genotype (and/or phenotype) studies
Recall-by-genotype (RbG) studies, applied to population cohorts with genomic data, are of increasing interest to RbG studies can be based on genotype groups at a single variant (or an allelic series for a gene), but also permit polygenic variant designs (e.g. extremes of polygenic risk scores ELGH is not a traditionally designed epidemiological cohort with deep data collection at recruitment (such as would be expected in a birth cohort) but does reflect an increasing trend towards pragmatic, 21st century health data-driven population study design 23 . The ability to invite all Stage 1 participants to recall studies opens the possibility to develop sub-cohorts (including collection of research grade data, as well as routine clinical care data), trials-within-cohorts and other innovative study designs in the future.
Can I get hold of the data? Where can I find out more?
External researchers and invited to participate in ELGH through the use of data generated in Stage 1 and Stage 2, as well as through the design of bespoke Stage 2 studies targeting gene variants and/or phenotypes of interest. ELGH offers an open-access resource to researchers, whether national or international, academic or industrial.
Data access is managed at several levels depending on the sensitivity and identifiable nature of the data: • Level 1 -fully open data. We distribute summary level data via our website e.g. current summary genotype counts and annotation of knockout variants from exome sequencing of 3,782 volunteers ( www.genesandhealth.org/research/scientific-data-downloads ) to date downloaded by >100 users.
• Level 2 -Genotype data (from SNP chip genotyping, or high throughput sequencing) is (or will be made) available under Data Access Agreement. Individual sequencing (e.g. cram) and genotype files (e.g. vcf) are available within 6 months on the European Genome-phenome Archive 24 (EGA). Access approval is granted by the Wellcome Sanger Institute Data Access Committee, who are independent of ELGH investigators.
• Level 3 -Individual-level phenotype data is held in an ISO27001 and NHS Information Governance compliant Data Safe Haven environment under Data Access Agreement, which also contains the latest genetic data linked to the questionnaire and health record phenotypes. This "bring researchers to the data" model allows us to present the most recent data to researchers easily, update data easily, maintain complex linkages between multiple datasets easily, and avoids multiple large file data transfers for genomic datasets. This model also permits us to reassure volunteers that their sensitive health data will be carefully looked after -in particular providing maximum security against large data breaches (e.g. as experienced by Facebook and British Airways in 2018). External data export is controlled, and individual level data export will not be allowed without very good reason. The current data safe haven is the UK Secure e-Research Platform 25 , hosted by Swansea University, based on the SAIL databank 26 , and supporting Dementias Platform UK amongst other UK cohort studies.
ELGH also supports research studies recalling volunteers by genotype or phenotype (local, external and industry). Two RbG studies, both led by non-ELGH academic researchers, are underway. The first stage 2 RbG study led by a life sciences industry partner is about to commence. External researchers and consortia are able to apply to undertake research with East London Genes and Health via a formal application process, the details of which are available on its website . Applications are assessed by both the Executive Board and Community Advisory Group, according to