Cohort Profile: East London Genes & Health (ELGH), a community-based population genomics and health study in British Bangladeshi and British Pakistani people

Cohort Profile: East London Genes & Health (ELGH), a community-based population genomics and health study in British Bangladeshi and British Pakistani people Sarah Finer , Hilary C Martin, Ahsan Khan, Karen A Hunt, Beverley MacLaughlin, Zaheer Ahmed, Richard Ashcroft, Ceri Durham, Daniel G MacArthur, Mark I McCarthy, John Robson, Bhavi Trivedi, Chris Griffiths, John Wright, Richard C Trembath and David A van Heel * Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, UK, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK, London Borough of Waltham Forest, Waltham Forest Town Hall, Walthamstow, UK, Department of Law, Queen Mary University of London, London, UK, Social Action for Health, London, UK, Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA, Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA, Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK, Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Oxford, UK, Oxford NIHR Biomedical Research Centre, Churchill Hospital, Oxford, UK, Bradford Institute for Health Research, Bradford Teaching Hospitals National Health Service (NHS) Foundation Trust, Bradford, UK and School of Basic and Medical Biosciences, Faculty of Life Sciences and Medicine, King’s College London, London, UK

Hackney, Barking and Dagenham are the 9th, 10th and 11th most deprived local authorities in England). 5 Compared with White British people, British South Asians living in east London have a 2-fold greater risk of developing T2D, 6 nearly double the risk of non-alcoholic liver disease 7 (many volunteers are practising Muslims and do not drink alcohol) and over double the risk of multimorbidity, 8 with the onset of cardiovascular disease occurring 8 years earlier in men. 8 Determinants of poor cardiometabolic health start early in the life course, and east London rates of overweight and obese children are among the highest in the UK.
Recent genomic advances offer exciting potential to better understand the genetic causation of disease, 9 including rare loss-of-function gene variants. 10 Genetic variation relevant to British Bangladeshi and British Pakistani populations, such as autozygosity arising from parental relatedness, is under-researched with regards to potential effects on complex adult phenotypes at a population level. 11,12 ELGH fosters authentic, inclusive, long-term engagement in its research, to deliver future health benefits to the population it represents. Community involvement in ELGH helps prioritize areas for research, including T2D, cardiovascular disease, dementia and mental health. ELGH undertakes a range of public engagement work, including collaboration with the award-winning Centre of the Cell. 13 Who is in the cohort?
ELGH (see Figure 1) incorporates population-wide recruitment to Stage 1 studies, and targeted recruitment to Stage 2 recall-by-genotype (RbG) studies. Stage 3 and 4 studies are planned.
During Stage 1, ELGH invites voluntary participation of all British Bangladeshi and British Pakistani individuals aged 16 and over, living in, working in or within reach of east London. Recruitment is largely undertaken by bilingual health researchers, and takes place in: (i) community settings, e.g. mosques, markets and libraries, supported by a third-sector partner organization (Social Action for Health); and (ii) health care settings, e.g. GP surgeries, outpatient clinics. Stage   Summary data from the Stage 1 volunteer questionnaire and EHR data linkage are presented in Table 1, including both baseline and longitudinal health data. Basic demographics of ELGH volunteers are compared with population-wide data in Figure 2, and highlight that the convenience sampling approach in Stage 1 recruitment has achieved a sample broadly representative of the background population with regard to age and sex, but which modestly favours recruitment of women over men in those aged <45 years. ELGH volunteers live in areas of high deprivation (97% in the most deprived two quintiles of the Index of Multiple Deprivation How often have they been followed up? ELGH contains real-world EHR data, its collection triggered by a broad range of clinical encounters including routine and emergency care. East London has an extensive track record of using routine clinical health care data (predominantly from primary care) in research studies. 6,7,14 Electronic performance dashboards are embedded in clinical practice, facilitating high quality and equitable disease screening and clinical care. 15,16 Primary care health records were digitized around 2000 and offer a rich source of data on clinical encounters since then, but also include pre-digitization dates of diagnoses and summarized clinical events (e.g. type 2 diabetes, diagnosed in 1992). Health data linkage and extraction takes place 3-monthly and ELGH volunteers have consented for lifelong EHR access, facilitating longitudinal follow-up.
ELGH can invite volunteers to Stage 2 studies up to four times per year for more detailed study visits, e.g. recall by genotype (RbG) and/or phenotype, for clinical assessment and collection of biological samples, subject to ethics approval, volunteer acceptability and community advisory group approval. As at August 2019, around 60 ELGH volunteers have participated in Stage 2 RbG studies.

What has been measured?
Available data are summarized in Table 2.
• Volunteer questionnaire (Supplementary File 1, available as Supplementary data at IJE online). This self-report questionnaire collects brief data including: name, date of birth, sex, ethnicity, contact details, diabetes status, parental relatedness and overall assessment of general health and well-being. The questionnaire has been designed to facilitate high throughput recruitment and volunteer inclusivity where language and cultural differences may exist, and to be used with or without researcher assistance. The questionnaire does not capture environmental factors (e.g. no self-reported data on smoking, alcohol, diet, physical activity-although smoking and alcohol use are available from other data sources, discussed below). Completion of the volunteer Data concordance was checked between volunteer questionnaires and their EHR, with >99% concordance for gender and year of birth. Almost all cases of data discordance were due to technical errors with questionnaire optical character recognition or user data completion, and were resolved with manual checking. Data outside clinically plausible ranges, or with clear data entry errors, are removed. A detailed description of our data processing is in Supplementary File 3, available as Supplementary data at IJE online. Missing data exist, but at relatively low frequency in routinely collected and incentivized clinical measures, e.g. smoking status is recorded in the EHR of 88% of volunteers in the 5 years preceding the most recent data linkage. Repeated measures of routinely collected data and cross-validation across information sources can mitigate the impact of missing data where it exists, as can statistical techniques, such as sensitivity analysis and multiple imputation. 17 • NHS local secondary care health record data linkage.
Linkage  46 662 Multi-Disease variants). 20 Array content includes rare disease-associated mutations (e.g. all pathogenic and likely pathogenic variants in ClinVar), pharmacogenetic associations and genome-wide coverage for association studies (based on the 26 populations present in Phase III of 1000 Genomes Project, optimized for imputation accuracy), polygenic risk score and Mendelian randomization studies.
In 2019/2020, if support is secured from an evolving Life Sciences Industry Consortium or elsewhere, highdepth exome sequencing will be performed on up to 50 000 volunteer samples. The intention is for genotyping

What has it found? Key findings and publications
ELGH is a new resource that continues to grow in size and content and, to date, has been used for three main areas of work, as follows.

Characterization of common phenotypes
Using Type 2 diabetes (T2D) as an exemplar, we show the ability for detailed phenotypic characterization of ELGH volunteers using EHRs (Table 3). Of 21 514 volunteers in ELGH with available linked EHR data, 4769 (22%) have a diagnosis of T2D in their primary care record. Basic sociodemographic data (age, gender, ethnicity) of volunteers were recorded in 100%, and smoking status had been obtained within 2 years of the most recent data linkage in 94%. In over 97% of volunteers with T2D, body mass index, markers of glucose control (HbA1c) and serum cholesterol were measured and available in the 2 years preceding ELGH participation. Hypertension, ischaemic heart disease and chronic kidney disease were observed in 47%, 15% and 11%, respectively, of the 4769, and erectile dysfunction was present in 26% of men. Retinal complications of T2D are recorded and graded, with 82% of volunteers having undergone screening within the past 2 years. Prescribing data show recent insulin prescriptions in 16%, and the use of single or multiple non-insulin agents, as well as use of cardiovascular drugs (e.g. lipid-lowering therapy). These data show the potential to perform cross-sectional analyses in ELGH from EHR data. EHR data also give the potential to study longitudinal phenotypic traits, retrospectively and prospectively. Median duration of T2D in ELGH volunteers was 7 years (range 0-51 years). For all volunteers with T2D, year of onset was recorded, and prescribing data and clinical measurements (including body mass index, HbA1c and cholesterol) at the time of diagnosis (þ/-6 months) were available for nearly two-thirds of volunteers. Before T2D onset, 26% (993) had had a diagnosis of pre-diabetes and 16% (370) of women had had a diagnosis of gestational diabetes, allowing study of progression from at-risk to disease states.
Multimorbidity is an increasing problem in ageing populations with high rates of chronic long-term disease; in the ELGH population we identified that of the 4769 ELGH volunteers with T2D, 80% had at least one, and 27% had two or more cardiovascular multimorbidities (Table 3).

Rare allele frequency gene variants occurring as homozygotes, including predicted loss-of-function knockouts
All ELGH volunteers self-reporting parental relatedness (19%) have been selected for exome sequencing. Genomic autozygosity (homozygous regions of the genome identical by descent from a recent common ancestor) means that rare allele frequency (minor allele frequency <0.5%) variants normally only seen as heterozygotes are enriched for homozygote genotypes. ELGH expands existing, smaller studies of autozygosity to investigate the health and population effects of such variants, with a focus on loss of function variants. 11,21,22 The accuracy of self-reported parental relatedness to actual autozygosity measured at the DNA level by exome sequencing (Figure 3) is a modest predictor of actual autozygosity, e.g. we find 8.2% of individuals who declare that their parents are not related in fact have >2.5% genomic autozygosity. For British Bangladeshi volunteers, mean autozygosity is slightly lower than expected given the reported parental relationship (possibly due to confusion over the meaning of e.g. 'first cousin' versus 'second cousin'), whereas for British Pakistani volunteers, mean autozygosity is slightly higher than expected (possibly due to historical parental relatedness). With an ELGH sample size of 100 000 we estimate we will identify rare variant-predicted loss-of-function homozygotes in >5000 human genes. ELGH plans to work with other studies on an international human knockout variant browser.

Recall by genotype (and/or phenotype) studies
RbG studies, applied to population cohorts with genomic data, are of increasing research interest 23 and use the random allocation of alleles at conception (Mendelian randomization) to aid causal inference in population studies, reduce biases seen with observational studies and develop functional studies. RbG studies can target specific single variants (or an allelic series for a gene) and polygenic variants (e.g. extremes of polygenic risk scores).
ELGH is undertaking RbG studies in Stage 2 using bespoke clinical phenotyping tailored to the genotype or phenotype of interest. To date, three research consortia have Table 3. Example of a specific disease phenotype: characteristics of ELGH volunteers with type 2 diabetes. Data are presented in summary and descriptive formats as indicated. Missing data are estimated where available, e.g. for clinical care processes and measurements, but not diagnostic coding where the absence of a code is taken to indicate the absence of a diagnosis undertaken ELGH RbG studies, one recalling volunteers with loss-of-function gene variants relevant to immune phenotypes, another phenotyping individuals with rare variants in genes implicated in T2D and obesity and a third involving an industrial partnership to aid therapeutic development for a rare autosomal recessive metabolic disorder. 24 Successful recall completion rates to these RbG studies are between 30% and 40%.
What are the main strengths and weaknesses?
ELGH has multiple strengths as a large, population-based study, and its novel, pragmatic design offers opportunities to combine genomic investigation with longitudinal and cross-sectional description of health and disease as determined from EHR data. 25 ELGH reaches a British Bangladeshi and British Pakistani population with a high  burden of disease, generalizable to a wider global population and building on existing genetic studies that have been criticized for focusing on White populations and substantially under-recruiting from minority ethnic groups. 26 High rates of autozygosity in ELGH volunteers lead to homozygous genotypes at variants with rare allele frequencies that will aid gene discovery, and RbG studies will generate novel translational impact. 11,24 Future studies on autozygosity will inform novel population level insights into the impact of genetic variation on health. The ability to invite all volunteers to Stage 2 studies offers the possibility to develop subcohorts and trials within cohorts in the future. Our community-based recruitment approach offers broad reach into the target population. However, to date, ELGH has modestly over-recruited British Bangladeshi versus British Pakistani volunteers. To support increased recruitment of British Pakistani volunteers, recruitment is expanding into outer London boroughs and a new Bradford Genes & Health.
The use of real-world EHR data is both a strength and weakness of ELGH. Strengths include the ability to obtain longitudinal data available on multiple diseases and disease risks via primary care, in large numbers of volunteers in a feasible and cost-effective manner. Data linkage is not yet complete, but will improve in 2019 with improved infrastructure and linkage to national registries and databases. Weaknesses are that EHR data may be inferior to observational epidemiological studies in ascertaining some phenotypes, e.g. recent diseases of minor severity (which do not necessarily require health care access) or subclinical disease. Additionally, although outcomes can be studied relatively well, EHR data have limited opportunity to study certain exposures, e.g. health behaviours, physical activity, diet and some other environmental influences.
Can I get hold of the data? Where can I find out more?
ELGH offers an open access resource to international, academic and industrial researchers to drive high-impact, world-class science. Data access is managed at several levels, as follows.  28,29 The data safe haven contains the latest genetic data linked to the questionnaire and health record phenotypes, and data export is tightly controlled. This 'bring researchers to the data' model allows us to share regular data updates, maintain complex data linkages and avoid large file data transfers. This model provides robust reassurance to volunteers that their health data will be carefully looked after, with maximum security against data breaches.
External researchers can to apply to undertake research with ELGH via a formal application process(details are available on the website), and most will be required to have their own research ethics approval to work with ELGH. Applications are assessed by both the executive board and community advisory group, according to community prioritization, acceptability and scientific merit. • ELGH prioritizes studies in areas important to, and identified by, the community it represents. Current priorities include cardiometabolic diseases (high prevalence of early onset) and mental illness. However, studies in any scientific area are possible, subject to community advisory group and ethical approval.
• ELGH combines health data science [using linked UK National Health Service (NHS) electronic health record data] with exome sequencing and SNP array genotyping to elucidate the genetic influence on health and disease, including the contribution from high rates of parental relatedness to rare genetic variation and homozygosity (autozygosity), in two understudied ethnic groups. Linkage to longitudinal health record data enables both retrospective and prospective analyses.
• Through Stage 2 studies, ELGH offers researchers the opportunity to undertake recall-by-genotype and/ or recall-by-phenotype studies on volunteers.
Subcohort studies, trials within cohort and other study designs are possible.

Supplementary data
Supplementary data are available at IJE online.