Cohort Profile: Centro de Integração de Dados e Conhecimentos para Saúde (CIDACS) Birth Cohort

Cohort Profile: Centro de Integraç~ ao de Dados e Conhecimentos para Saúde (CIDACS) Birth Cohort Enny S Paixao ,* Luciana L Cardim, Ila Rocha Falcao, Naiá Ortelan, Natanael de Jesus Silva, Aline dos Santos Rocha, Samila Sena, Daniela Almeida, Dandara Oliveira Ramos, Flávia Jôse Oliveira Alves, Nı́vea Bispo, Sanni Ali, Rosemeire Fiaccone, Moreno Rodrigues, Liam Smeeth, Elizabeth B Brickley, Liliana Cabral, Carlos Teles, Maria Conceiç~ ao N Costa, Maria Yury Ichihara, Mauricio L Barreto, Rita de Cássia Ribeiro Silva and Maria Gloria Teixeira Centro de Integraç~ao de Dados e Conhecimentos para Saúde, Fiocruz, Salvador, Bahia, Brazil, Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK, Escola de Nutriç~ao, Universidade Federal da Bahia, Salvador, Brazil, Instituto de Saúde Coletiva, Universidade Federal da Bahia, Salvador, Bahia, Brazil, and Departamento de Estatı́stica, Universidade Federal da Bahia, Salvador, Bahia, Brazil

dynamic Brazilian birth cohort. The use of large, routinely collected, high-quality social and health databases provides a unique opportunity to examine factors that might result in long-term and rare child and mother outcomes over time without the limitations of a traditional cohort, such as limited sample sizes and expensive resources.
The CIDACS Birth Cohort is housed at the Centre of Data and Knowledge Integration for Health (CIDACS), a unit of the Oswaldo Cruz Foundation in Bahia, Brazil. CIDACS also houses the 100 Million Brazilian Cohort. CIDACS works in the spectrum of data acquisition for linkage data from large Brazilian national databases, management, analysis and interpretation with ethical use and privacy issues. 16 Ethical approval was obtained from the Federal University of Bahia's Institute of Public Health Ethics Committee (CAAE registration number: 18022319.4.0000.5030).

Who is in the cohort?
Brazil has about 3 million births a year. A total of 44 485 267 births were recorded in the live birth system (SINASC) over 2001-15. The CIDACS Birth Cohort population is composed of 24 695 617 (55%) children born alive in Brazil between 1 January 2001 and 31 December 2015 which linked with the baseline of the 100 Million Brazilian Cohort through common maternal information, which exists in the two datasets. All children with information recorded in the live birth system (SINASC) were eligible for linkage.
The SINASC records live births in Brazil, using a standardized form, completed by a health professional who assisted the child's delivery. This form has information on pregnancy and delivery of newborns, including congenital anomalies, birthweight and sex. An evaluation of the birth registration system in Brazil found that over 97% of Brazilian live births are registered in this system. 17,18 The baseline of the 100 Million Brazilian Cohort was created using administrative records from over 114 million individuals aged 16 years or older, whose families applied for social assistance via the Unified Register for Social Programmes (Cadastro Ú nico para Programas Sociais: CadUnico). Since 2003, the CadUnico has become the main instrument used by the Brazilian government to assess the inclusion criteria of potential beneficiaries of social programmes. To be enrolled in CadUnico, one person in the family must provide information and required documents of all family members to an interviewer. This person must be at least 16 years old and, preferably, be a woman. The information is renewed periodically as long as the person is a candidate or enrolled in any one of the Brazilian benefits, such as Bolsa Familia (cash transfer for low-income families) and Minha Casa Minha Vida and Beneficio de Prestac¸ão Continuada (continuous benefit for people with long-term disability) among others. 19 By the end of 2015, 40 542 929 families (comprising 114 001 661 individuals) had registered in CadUnico.
The characteristics of mothers and children in the CIDACS Birth Cohort were compared with the characteristics of the non-linked population of mothers and children registered in SINASC, to assess differences and similarities between our cohort populations (Table 1). A higher proportion of mothers of children born in the CIDACS Birth Cohort are younger, i.e. less than 20 years old (25% vs 15%) and unmarried (58% vs 43%), than those in the non-linked population recorded in SINASC. The proportion of mothers with 8 years or more of schooling were higher in the non-linked population (69%) compared with those included in the cohort (52%). Children included in the cohort were more likely to be born via vaginal delivery (60%) than the non-linked Brazilian births (42%). Children from minority ethnic backgrounds were included in the cohort. To date, the cohort includes 83 413 Indigenous children and 37 441 children born in Quilombo communities descended from African Brazilian fugitive slaves.

The linkage processes
We linked SINASC live births records with the baseline of the 100 Million Brazilian Cohort using the name of the mother, maternal age at birth, maternal date of birth and the municipality of residence of the mother at the time of delivery. We excluded records with missing or implausible names and duplicates. The linkage was performed with CIDACS-RL (Centre for Data and Knowledge Integration for Health-Record Linkage), 20 a novel record linkage tool developed to link big administrative datasets at the CIDACS. The linkage is detailed described in Almeida et al. (2020). 21 At CIDACS, the processing and linking of identified databases follow legal frameworks related to ethics, privacy and data security. The study protocol was reviewed and approved by the Federal University of Bahia's Institute of Public Health Ethics Committee (CAAE registration number: 18022319.4.0000.5030).
How often have they been followed up?
The individuals included in CIDACS Birth Cohort will be dynamically followed from birth to death. Brazil has several mandatory national health and social registries that allow us to track a range of events throughout the individual's life, including hospitalizations, infectious diseases occurrence, nutritional status, enrolment in social protection programmes and death ( Figure 1). The followup will proceed using two linkage strategies: (i) deterministic linkage through unique national identification numbers that allow the cohort participants to be linked to periodically renewed socioeconomic information in CadUnico datasets (by CadUnico regulation, as long as the person is a candidate to receive or recipient of one of the several Brazilian government social protection programmes, they have to update the information every 2 years); (ii) nondeterministic linkage of the baseline of CIDACS Birth Cohort with health administrative datasets. The linkage to update the information will be done every 2 years.  Table 2.
The Information System for Notifiable Diseases (SINAN) is the compulsory notification system for a list of infectious diseases, including dengue, zika, tuberculosis   and chikungunya. 22 Suspected and/or confirmed cases must be reported to the Epidemiological Surveillance Centre on a specific numbered notification form which is available in any local health facility. It collects information on the date of notification, date of onset of symptoms, date of birth, name of the patient, age, sex and address. The Epidemiological Surveillance Centre then investigates to confirm or discard the suspicion based on the Brazilian definition of case, specific for each disease. The quality of the data and years varies according to notified disease. 23 The Food and Nutrition Surveillance System (SISVAN) will be used to assess child and maternal nutrition. Data from this system are available over 2008-15 and has information on anthropometric measurements, including weight and height, food consumption, breastfeeding and complementary feeding practices. The national population coverage of SISVAN ranges between 10% and 15%, mainly among children and adolescents. For those registered in the cash transfer programme Bolsa Familia, who are also enrolled in the CadUnico, the SISVAN coverage varies from 57% to 86%. 24 All hospital admissions financed by the Brazilian National Health System (about 75% of all hospitalizations in Brazil) are recorded in the Information System of Hospitalizations (SIH). The hospitalization system includes personal patient information, date of hospitalization, duration, type of hospital, costs incurred, and causes of hospitalization. 25 The Information System of Mortality (SIM) uses the death certificate, a legal document. This form collects information on the deceased individual and the conditions, place and cause of death. In the case of fetal deaths or infant mortality, it also includes maternal characteristics. In 2015, it was estimated that SIM registered more than 97% of the Brazilian deaths. 26 What has been measured?
The CIDACS Birth Cohort includes basic information on the mother (name, place of residence, age, marital status, education) and her obstetric history [whether she had a stillbirth or miscarriage, whether she had a previous caesarean section (CS) or vaginal delivery], the pregnancy (length of gestation, type of delivery, fetal presentation), the newborn (birthweight, presence of congenital anomalies) and the antenatal care (number of visits and when care started). In addition to birth and maternal information obtained from SINASC, socioeconomic and demographic data from the 100 Million Brazilian Cohort, such as information on family dynamics, child care arrangements, parental employment, income, housing, family formation and dissolution, are available in the baseline of the CIDACS Birth Cohort. Information on growth, breastfeeding and infectious disease has been included in the cohort follow-up. Although most variables have less than 10% missing data, there is a substantial proportion of missingness for the variables on the mother's history of stillbirth or miscarriage (16% missing) and the employment situation of the household head (54% missing).

What has it found?
To date, the CIDACS Birth Cohort has been used to analyse birth and mortality outcomes. Preterm births (<37 weeks of gestational age), low birthweight (<2500 g) and congenital anomalies were observed in 8.1%, 8.3% and 0.7% of the total births included in the cohort, respectively ( Table 1). The deaths occurred from the first hours of life to the age of 14 years, and more than 80% of the deaths in our cohort occurred before the first year of life, mainly during the neonatal period (less than 28 days old).
Further linkage between CIDACS Birth Cohort baseline and other follow-up datasets are ongoing. The linkage with SINAN is being held to evaluate the impact of maternal infections, including zika and syphilis, on early outcomes (prematurity, low birthweight, congenital anomalies) and late outcomes (hospitalization and mortality). In addition, the linkage with SISVAN is being conducted in order to analyse child growth curves and the effect of maternal nutrition on birth and child growth outcomes.
What are the main strengths and weaknesses?
CIDACS Birth Cohort has several strengths. First, it links health and social data coming from various government sectors, adding enormous value to already existing health data in determining both the drivers of health and the consequences of ill health. Second, its longitudinal structure makes possible to: (i) add new exposures or outcomes over time; and (ii) study outcomes at different times of exposure, including long-term outcome. Third, the large sample sizes allow analysis of small groups and rare events in ways that are not possible in projects that are dependent on the primary collection of new data. Fourth, we have included in our data information on isolated populations, such as Iindigenous people and Quilombo communities descended from AfricanBrazilian fugitive slaves. Fifth, the use of administrative data eliminates the risk of recall bias, which is a problem if data collection relies on self-reports of service use (e.g. hospitalization or birth). Sixth, the linkage has been conducted with robust and accurate software developed in-house (CIDACS-RL), and a specialized team evaluates each linkage performed at CIDACS.
There are some limitations that must be considered when analysing the CIDACS Birth Cohort. To measure follow-up can be a complex task in large, linked datasets where individuals have complex histories and errors can be present. The cohort baseline is the linked population of SINASC and population from the baseline of the 100 M Brazilian cohort, both routinely collected data that have not been designed for research purposes. Therefore, it brings well-known limitations relating to missing, underestimation and potential misclassification of data. For example in SINASC, the proportion of preterm births recorded was found to be underestimated by 15%, and misclassification, based on the criteria used to assess the gestational age at birth information (date of the last period), could have occurred. 27 However, these errors probably affected the entire dataset. We have a considerable proportion of missing values in variables that are not mandatory in CadUnico, such as the occupation of the household members (54%). Nevertheless, the description of all individuals in the household (e.g. sex, age, education and ethnicity) and variables such as income, key variables that are used as eligibility criteria for social programmes, have good completeness.
A limitation that must be discussed concerning each specific research question is the characteristics of people enrolled at CadUnico (poorest half of the Brazilian population). There is a socioeconomic gradient that influences prevalence estimates, as reflected in higher rates of vaginal delivery that, in Brazil, is less common among wealthy families (Table 1). However, the CIDACS Birth Cohort aims to provide valid estimates of associations between putative causal factors and disease, and the prevalence of both exposures and diseases may be different from what is found in the general population. However, the estimate of association can still be valid. Several validity studies will be performed to address this question.
The linkage process posed several challenges, such as linking different individuals (mother-baby) due to the limited numbers of identifiers that have tended to yield higher rates of linkage error, commonly due to inaccurate or incomplete provision of identifiable data. The most critical barrier to linking maternal and live birth records is the limited availability of common and complete personal identifiers, which directly impacts on sensitivity results, that tend to be lower. A validation study estimated that the overall proportion of linked people between 2001 and 2015 was 59%. However, this was not constant over the years, and from 2012 the sensitivity reached about 80%, reaching values similar to studies developed in Georgia and New Jersey, USA. 28,29 Can I get hold of the data? Where can I find out more?
Data that support the information presented are available upon request from the CIDACS and on ethical approval. The data are not publicly available due to restrictions, as they contain information that could compromise the privacy of the research population.
Currently, only national and international researchers who collaborate with CIDACS, and authorized staff from government agencies, can have controlled access to deidentified linked data. These individuals and organizations must be committed to advancing scientific knowledge or generating evidence for public policy formulation. Researchers can access relevant de-identified data for their proposed study objectives exclusively via secure remote access to virtual machines.
Persons who wishes to receive authorization must: (i) be affiliated to the institution or be identified as collaborators; (ii) present a detailed research project together with ethical approval by an appropriate Brazilian institutional; (iii) provide a clear data plan restricted to the objectives of the proposed study and a summary of the analyses plan intended to guide the linkage and or data extraction of the relevant set of records and variables; (iv) sign terms of responsibility regarding the access and use of data; and (v) perform the analyses of datasets provided using the CIDACS data environment, a safe and secure infrastructure that provides remote access to de-identified datasets and analyses tools. For more information, please visit the CIDACS website [https://cidacs.bahia.fiocruz.br/] or contact us via email [cidacs@bahia.fiocruz.br].