Data Resource Profile: Nationwide registry data for high-throughput epidemiology and machine learning (FinRegistry)

Data Resource Profile: Nationwide registry data for high-throughput epidemiology and machine learning (FinRegistry) Essi Viippola , Sara Kuitunen, Rodosthenis S Rodosthenous, Andrius Vabalas, Tuomo Hartonen , Pekka Vartiainen, Joanne Demmler, Anna-Leena Vuorinen, Aoxing Liu, Aki S Havulinna , Vincent Llorens, Kira E Detrois, Feiyi Wang, Matteo Ferro, Antti Karvanen, Jakob German, Sakari Jukarainen, Javier Gracia-Tabuenca, Tero Hiekkalinna, Sami Koskelainen, Tuomo Kiiskinen, Elisa Lahtela, Susanna Lemmelä, Teemu Paajanen, Harri Siirtola, Mary Pat Reeve, Kati Kristiansson, Minna Brunfeldt, Mervi Aavikko , Finn Gen, Markus Perola , Andrea Ganna* Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland, Public Health and Welfare, Finnish Institute for Health and Welfare (THL), Helsinki, Finland, Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, MA, USA, TAUCHI Research Center, Tampere University, Tampere, Finland and Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA *Corresponding author. FIMM, University of Helsinki, PL 20 (Tukholmankatu 8), 00014 Helsinki, Finland. E-mail: andrea.ganna@helsinki.fi A full list of members and their affiliations is available in the Supplementary Materials (available as Supplementary data at IJE online). Indicates equal contributions.


Data resource basics
Nationwide health-related registry data provide comprehensive insights into population health and, combined with other data such as demographics, familial relations and socioeconomic data, enable the exploration of various dimensions of human behaviour and health. With the increasing size and variety of available data, advanced statistical and machine learning methods present novel possibilities for prediction and causal inference. [1][2][3] At the same time, efforts to extract highquality 'phenotypes' from registries, for example clinical endpoints, are needed to provide interpretable results. In this spirit, projects such as the CALIBER initiative have provided curated phenotype definitions based on UK's primary and secondary health care data. 4 Traditionally, the identification of risk factors and the creation of prediction models for diseases have been conducted using a targeted approach where a specific condition, risk factor or medication is studied for its association with a single disease. Several studies [4][5][6][7][8][9][10] have shown the potential of data-driven approaches in examining the associations of a large number of risk factors and thousands of disease trajectories. These studies have been accompanied by an increasing trend towards making the results publicly available through web portals, to enable the re-use of the results by other researchers.
The FinRegistry research project [www.finregistry.fi] seeks to model the complex relationship between health and various risk factors by developing statistical and machine learning models using high-resolution longitudinal registry data. The project is a joint effort led by the Finnish Institute for Health and Welfare (THL) and the Institute for Molecular Medicine Finland (FIMM), University of Helsinki. Access to FinRegistry is granted via the Finnish Social and Health Data Permit Authority Findata, which provides a clear and transparent application process and delivers the data in a secure computing environment. No ethics approval is required but instead, Findata examines the data access requests and grants a fixed-term data permit for processing confidential materials containing personal data under the Act on the Secondary Use of Health and Social Data. 11 FinRegistry data are collected, used and stored in accordance with the General Data Protection Regulation. FinRegistry is funded by the European Research Council under the European Union's Horizon 2020 research and innovation programme.
FinRegistry data are collected across 19 registries covering the Finnish population's public health care visits, health conditions, medications, vaccinations, laboratory responses, demographics, familial relations and socioeconomic variables. As in other Nordic countries, the data are collected in nationwide electronic registries. 12 The earliest year of data collection varies by the registry, with the Finnish Cancer Registry being the oldest and dating back to 1953. Pseudonymized individual-level data from different registers can be linked together using pseudo-IDs that replace the unique personal identification number assigned to each individual residing in Finland, and familial relations allow the connection of individuals with their close relatives and their respective registry data. Furthermore, including geospatial data (geographical coordinates of the place of residence) enables the integration of open-access geographical data, such as the average environmental pollution of the area.
The study population in FinRegistry is fully representative of the Finnish population: FinRegistry covers individuals living in Finland on 1 January 2010 (FinRegistry index persons) as well as their parents, spouses, children and siblings (non-index relatives), with the exception of individuals excluded due to non-disclosure for personal safety reasons. To date, the data comprise 5 339 804 index persons and 1 826 612 nonindex relatives, making up a total sample size of approximately 7.2 million individuals. The number of persons included and the years covered by each registry are presented in Figure 1, and more details are available in Supplementary Table S1 (available as Supplementary data at IJE online).
FinRegistry is a unique nationwide registry resource because of: (i) the scale and diversity of data linkage; (ii) extensive quality control, including the generation of curated health register-based clinical endpoints obtained by leveraging multiple registries and clinical expertise as part of the FinnGen project 13

Data sources
FinRegistry data can be broadly categorized into the following partially overlapping categories: (i) health care visits and health conditions; (ii) medications and vaccinations; and (iii) demographics and socioeconomics. Registers included in each category and the years and numbers of persons covered are presented in Supplementary Table S1. A publicly available data dictionary is linked on the FinRegistry website [www.fin registry.fi/finnish-registry-data] and the code for data preprocessing is available on GitHub. 14 Health care visits and health conditions comprise extensive data resulting from patient contacts with primary (since 2011 in FinRegistry), secondary (since 1969) and intensive care (since 2020), and cover details on the health care specialty.
Additional information is available on psychiatric patients and those with demanding heart diseases (since 1994). Primary and home care information has been collected since 2011 and was recently extended to include data from private health care. Private service providers account for approximately a quarter of Finland's social and health services. 15 Laboratory results collected as part of public and private health care (e.g. blood glucose levels, blood cell counts and liver enzymes) are available since 2014. Disease-specific information on cancer (since 1953), microbiologically-confirmed infectious diseases (since 1995), including COVID-19, and congenital malformations (since 1987) are included in separate registers. Detailed information on the mother's pregnancy, pregnancy-related risk factors, the delivery, the neonate's information and neonatal conditions are also collected in the Medical Birth Register (since 1987).
Medications and vaccinations cover the purchase (since 1995) of reimbursable medicines, all electronic prescriptions and their delivery records made in pharmacies (since 2010), as well as the vaccinations given in public health care (since 2011). The medication-related registries include information on pharmaceutical attributes, such as the Anatomical Therapeutic Chemical (ATC) classification code, 16 package size, formulation, dosage and cost. The underlying health indications for prescribing certain medications are reported in text format during the prescription, but this information is incomplete for some entries. Of note is that certain health conditions for which medications are eligible for reimbursement, such as type 2 diabetes, are recorded in the Drug Reimbursements register and represent a high-quality source to identify disease diagnoses as early as 1968. Vaccination information contains, among others, the administration of COVID-19 vaccines.
Demographic information and socioeconomics data include sex, date of birth and death, familial relations, marriage history and the longitudinal coordinates of the place of residence, including immigration and emigration dates, as well as information on education, employment, labour income, pensions and social assistance. A multigenerational register includes familial relations for first-degree relatives (mother, father, children and siblings) of the FinRegistry index persons. Based on first-degree relations, it is possible to construct a population-wide pedigree linking more distant relatives. 17 Living history (since 1971) is used to link open-sourced geocoded data, such as the degree of urbanicity, the number of vacant housing units and the average amount of alcohol  18 Risteys [https://risteys.finregistry.fi] is a publicly available web portal that enables exploration of clinical endpoints interactively. Risteys is the go-to place to gain insights into disease epidemiology in the Finnish population. The portal provides information on endpoint definitions and descriptive statistics in FinRegistry and FinnGen, including distributions of age and year at the first event, cumulative incidence estimates and mortality statistics. Risteys is constantly expanding to include results from high-throughput analyses performed in FinRegistry. The Risteys source code is available on GitHub 19 and more information is presented in the Supplementary Materials.

Data resource use
FinRegistry is a curated, nationwide, register-based data resource for developing statistical and machine learning models, performing high-throughput epidemiological analyses and deriving outcome-specific prediction models. For example, benefiting from the rich longitudinal health and multigeneration registers, we have used FinRegistry data to comprehensively assess the role and relative importance of 414 diseases in childlessness over the entire reproductive lifespan of both men and women. 20 We have also implemented a machine learning approach to examine the association between 2890 health, socioeconomic, familial and demographic factors with the uptake of the first COVID-19 vaccination dose. 21 Last, we have implemented the Risteys portal to enable exploration of clinical endpoints and the results of high-throughput epidemiological analyses, such as mortality statistics, as described below.
High-throughput mortality analysis was applied to study the association between each clinical endpoint and death and to estimate the mortality risk associated with the clinical endpoints in the Finnish population. We used the Cox proportional hazards model 22  Ongoing projects using FinRegistry data focus on developing both traditional prediction models and novel machine/ deep learning-based approaches. For example, we are currently developing a clinical prediction model to assess the infant's risk of severe respiratory syncytial virus-caused disease which aims at helping the administration of novel immunoprophylaxis methods against the disease. 23 Additionally, we are exploring methods to generate latent representations, or embeddings, across all FinRegistry data. These latent representations will help reduce the data dimensionality while identifying the major axes of variation in the underlying data. We Figure 2. Distributions of hazard ratios for associations between each clinical endpoint and death (panels A1 and A2) and the highest hazard ratios (panels B1 and B2) for males and females. HR, hazard ratio; CI, confidence interval e198 are further leveraging the extensive population-scale pedigree by using Graph Neural Network to improve risk prediction. Finally, we are expanding the Risteys content by including additional high-throughput epidemiological analyses, such as disease-to-disease survival analyses.

Strengths and weaknesses
FinRegistry data represent the entire population of Finland, allowing researchers to conduct large-scale cohort studies with a relatively low risk of selection bias and to study health events with higher statistical power. Finnish national registers include decades of data, with nearly a third of the registries covering half a century. For example, FinRegistry data have almost complete coverage of all major health-related events due to the inclusion of all treatments of severe and acute illnesses, emergency room visits, inpatient hospitalizations and major surgical operations carried out in the secondary health care of the public sector. The sociodemographic, social care and population registries similarly have virtually full coverage. The long follow-up is valuable when studying diseases with a long latent period between the exposure and the disease onset or in family-based studies requiring long follow-up periods for two or more generations. FinRegistry further combines the breadth of health data with a wide range of other information, including longitudinal data of the familial relations and the geographical coordinates of the place of residence, which in turn enables analysis of disease trajectories within families and linkage of the data to external datasets of the living environment. Finally, translating health data into clinical endpoints built on information obtained from multiple registries enhances the clinical relevance of the results and improves reproducibility across different health care systems.
As the data included in FinRegistry were not primarily collected for research, some clinically relevant variables are missing and the data do not cover all aspects of health care. Primary care-related data have been collected since 2011, with private health care being included only during recent years, and therefore the data coverage for conditions diagnosed and managed at the primary care level is limited. For instance, the Vaccination Register well covers COVID-19 vaccinations but not influenza vaccinations, as many of them are given in the private sector or occupational health care. Moreover, medications administered during hospital treatment (e.g. intravenous drugs) and non-prescription medications bought at pharmacies are not included. Primary care data include lifestyle and other modifiable risk factors, such as body mass index and smoking status, but a significant amount of data is missing. Such data are likely more often recorded if concerns are raised, resulting in lower coverage among healthy individuals. Despite the limitations above, FinRegistry provides unique, nationwide, integrated data covering various dimensions of human behaviour and health.

Data availability
See Data Resource Access, above.

Supplementary data
Supplementary data are available at IJE online.