Data Resource Profile: COVerAGE-DB: a global demographic database of COVID-19 cases and deaths

Information about pandemic dynamics is crucial to understand the potential impacts on populations, design mitigation strategies and evaluate the efficacy of their implementation. Centralization, standardization and harmonization of data are critical to enable comparisons of the demographic impact of COVID-19 which take into account differences in the age and sex compositions of confirmed infections and deaths. The international data landscape must keep pace with the global march of the pandemic, and researchers must work to triangulate the available data to create comparable measures to monitor and predict its demographic impacts. COVerAGE-DB aims to provide global coverage of key demographic aspects of the COVID-19 pandemic as it unfolds in an up-to-date, transparent and open-access format. COVerAGE-DB offers data with standardized count measures by sex and harmonized age groups, which is a necessary but not sufficient condition to allow comparisons between populations at national and subnational scales. The database is currently under expansion through both the increase in coverage of national and subnational populations and the inclusion of more recent periods as the pandemic continues. At the time of writing, the database contains daily counts of COVID-19 cases, deaths and tests performed, by age and sex, for 108 national and 371 subnational populations around the world, depending on the available data for each source. The date range available for each country or subpopulation varies. In several country series, the database includes the earliest confirmed cases in January 2020. For most populations, the database includes daily time series, beginning from an initial starting date when the data were first released or collected by our team. Figure 1 displays a map of countries included in the database, indicating at least one subnational population from 13 countries. A detailed overview of data availability is given in a searchable table: [https://bit.ly/3kVDrLD].

Information about pandemic dynamics is crucial to understand the potential impacts on populations, design mitigation strategies and evaluate the efficacy of their implementation. Centralization, standardization and harmonization of data are critical to enable comparisons of the demographic impact of COVID-19 which take into account differences in the age and sex compositions of confirmed infections and deaths. The international data landscape must keep pace with the global march of the pandemic, and researchers must work to triangulate the available data to create comparable measures to monitor and predict its demographic impacts.
COVerAGE-DB aims to provide global coverage of key demographic aspects of the COVID-19 pandemic as it unfolds in an up-to-date, transparent and open-access format. COVerAGE-DB offers data with standardized count measures by sex and harmonized age groups, which is a necessary but not sufficient condition to allow comparisons between populations at national and subnational scales.
The database is currently under expansion through both the increase in coverage of national and subnational populations and the inclusion of more recent periods as the pandemic continues. At the time of writing, the database contains daily counts of COVID-19 cases, deaths and tests performed, by age and sex, for 108 national and 371 subnational populations around the world, depending on the available data for each source. The date range available for each country or subpopulation varies. In several country series, the database includes the earliest confirmed cases in January 2020. For most populations, the database includes daily time series, beginning from an initial starting date when the data were first released or collected by our team. Figure 1 displays a map of countries included in the database, indicating at least one subnational population from 13 countries. A detailed overview of data availability is given in a searchable table: [https://bit.ly/3kVDrLD].

Data collected
Official counts of COVID-19 cases, deaths, and tests are extracted from reports published by official governmental institutions, such as health ministries and statistical offices. Depending on the source, data are collected in a variety of formats, including machine-readable files, pdf tables, html tables, interactive dashboards, press releases, official announcements via Twitter, and in a few instances, from digitized graphics. A full list of data sources is available in a dashboard view [https://bit.ly/2Qg1MxL].
Generally, COVID-19 cases, deaths and tests in age groups are reported as counts, but some sources report data in other metrics (fractions, percentages, ratios) or as summary indicators such as case fatality ratios (CFRs) by age. Reported age intervals vary by source, ranging from single ages to 30-year or greater age bands, and sometimes reported age intervals change over time within sources. Usually data are reported as cross-sectional snapshots of cumulative counts, but some sources give full time series of new cases or deaths, in which case we cumulate counts over time. We also collect standard metadata on each of the sources to capture various characteristics of the collected data, such as the primary collection channels, definitions used and notes on major disruptions or events. An overview of key fields from these metadata is shared as a spreadsheet [https://bit.ly/2FAmKFn].

Data production
All source data are entered into standard spreadsheet templates hosted in a central folder on Google Drive. Data entry into the templates is either manual or automatic, depending on the source. R programs collect data from the source templates and compile the merged input database. The merged input file is then subject to a series of automatic validity checks. Initial checks are carried out by the individual responsible for data collection and entry, using an interactive application [https://mpidr.shinyapps.io/cleaning_tracker/]. Data are then harmonized to standard metrics (counts), measures (cases, deaths, tests) and age bands (5-and 10-year age intervals). Harmonization procedures include rescaling to ensure coherence between age distributions and reported total counts. Age group harmonization is done using the penalized composite link model for ungrouping 1 which was designed for splitting histograms of count data. Output data also include a file containing selected diagnostics of data quality, such as completeness of age reporting, for each source and date.
The complete details on all steps of production are available in the COVerAGE-DB Method Protocol, which is publicly available on the web. 2 A table listing which adjustments are applied to each population is available on the project website [https://bit.ly/2E61BSV]. The merged input database, the harmonized output and the data quality files are uploaded daily as zipped csv files to an Open Science Framework repository (OSF) [https://osf.io/mpwjq/]. A GitHub repository [https://bit.ly/2YbtPCJ], which is linked to OSF, contains all R scripts used in the complete production pipeline, including compilation, diagnostics and harmonization.

Data resource use
Since collection efforts began for COVerAGE-DB in late March 2020, we are aware of 15 studies using the data, many of which provide R code online and are fully reproducible. Broadly, these studies aim to measure the influence of demographic factors on mortality from COVID-19, 3,4 assess the pandemic impact on health and mortality within 5,6 and across populations, 7-12 analyse COVID-19 data availability and quality, 13 propose methodological innovations that allow comparisons of CFRs 14 and the development of indirect methods to estimate infections in the population. 15,16 The database is also used to monitor COVID-19 impacts in particular age ranges. For instance, UNICEF has used the database for monitoring the burden of the pandemic on children around the world 17 and the UN Department of Economic and Social Affairs has used it similarly to focus on older age groups. 18 As an example of the analyses that COVerAGE-DB enables, Figure 2 displays changes in the relation between agespecific deaths and cases rates in Colombia, inspired by Figure 1 of Dudel et al. 14 We divide both cases and deaths in each age band by the respective population sizes. Diagonal lines indicate age-specific CFRs. The graph illustrates a sharp increase in CFR over age for each sex, and displays considerable sex differences. For instance, men aged 60-69 in Colombia have almost the same CFR (approximately 12% risk of death after COVID-19 disease diagnosis) as women aged 70-79.
We repeat this exercise to compare Colombia with Mexico (see Figure 3), where standardizing by population size is more justified. CFRs and death rates are much higher in Mexico than in Colombia in each age bandaround 2-fold-except for ages 80þ, which show a substantial reduction in the CFR difference, and much higher death rates for Colombia.
This comparison between Colombia and Mexico allows us to illustrate several issues in data quality to be considered when comparing COVID-19 outcomes between populations in general. Besides the economic and sanitary conditions that make Latin American countries more vulnerable to the pandemic, the lack of unambiguous definitions of COVID-19 cases and deaths and the limited testing capacity represent major challenges for data quality assessment. [19][20][21] We focus here on definitions and testing strategies.
With respect to COVID-19 case and death definitions, criteria have varied since records started. At the time of data retrieval, both countries use laboratory, clinical and epidemiological criteria to confirm SARS-CoV-2 infections. 22,23 However, the vast majority of COVID-19 cases and deaths are confirmed with RT-PCR tests results in both populations (99.6% and 91.6% in Colombia and Mexico, respectively). 24,25 Regarding the definition of tests, whereas in Colombia it refers to laboratory samples tested (4.5 M as of 7 November 2020), in Mexico it alludes to persons (2.3 M). Because individuals may be tested more than once, comparison between these two units is not straightforward. Testing performance measures, such as positive rates (e.g. 30% in Colombia and 45% in Mexico 26 ), are essential for interpreting differences in cases and deaths across populations, because they help to assess the extent of infection under-reporting. 27 However, differences in test definitions pose serious challenges for direct comparisons. Dates in both sources are comparable, corresponding to the occurrence of events. Since information from both sources relies on individual-level databases, delays in diagnosis and death registration are retrospectively adjusted.
Differences in testing capacity and strategy between countries are also key determinants for infection diagnosis. Given both the magnitude of contagion and limited resources in the region, Latin American countries have struggled to increase testing capacity proportionally to the spread of the infection. 28,29 Although with very limited capacity, the testing approach of Colombia has been to test as many suspected cases as possible. In contrast, an important part of the test strategy in Mexico has focused on inferring the extent of contagion in the population by using nationally representative samples (known as Centinela, which represent 36.5% of all confirmed infections at the date under observation), and it has gradually included a small proportion of suspected infections outside the Centinela system. 23   On 7 November 2020, Colombia performed five times more tests per capita than Mexico. These differences in testing regimes between both countries may account for a substantial part of the CFR discrepancies observed in Figure 3.
The differences in definitions and testing strategies between populations highlight challenges in making comparisons and also the need to produce data with sufficient detail to adjust for biases. For this reason, alongside data on cases, deaths and tests, COVerAGE-DB offers additional information on metadata and quality metrics that are needed for a cautious interpretation of the data and their limitations. It is our view that researchers should triangulate creatively from all available data rather than avoid difficult comparisons.

Strengths and weaknesses
Since the beginning of the pandemic, it has been evident that population characteristics are key to understanding the prevalence, spread and fatality of COVID-19 across countries. However, data on cases, deaths and tests disaggregated by age and sex are not easily comparable across countries, and sometimes not even accessible. The main strength of COVerAGE-DB is to provide a centralized, open-access and fully reproducible repository of age-and sex-specific case, death and test counts from COVID-19, collected from official sources and harmonized to standard output formats. The data harmonization process is transparent, following a strict protocol. 2 The initial input data are provided alongside the harmonized counts, as well as the code used to harmonize the different input measures, metrics and age groups into comparable granular output metrics. All scripts are written in the open-source R programming language. 30 The data sources and limitations are documented for each country in a standard metadata framework.
A limitation of the COVerAGE-DB is the heterogeneous and difficult-to-evaluate quality of the underlying data. No single data source can currently claim accurate estimates of COVID-19 incidence or fatalities. Age-specific case counts are highly dependent upon the testing capacity, 31 testing strategy 32 and differences in the definition of cases across sources and over time. Recorded cases underestimate infections everywhere, with underestimation expected to vary by age, given the relationship between age and case severity. 33 The accuracy of diagnostic RT-PCR tests used to confirm infections is also known to vary. 34 Furthermore, at any given date, cumulative counts are underestimated because of the lag between infection and a positive test result. 35 Death counts from COVID-19 are also likely underestimated for similar reasons and also due to various kinds of delays in death registration. Media reports have circulated about intentional data manipulation in some of the official data covered in the database. 36 Excess all-cause mortality has been observed across many regions. [37][38][39][40] Although some of these deaths likely are from postponing or foregoing treatment from non-COVID-19-related causes, the magnitude of this excess is suggestive that numerous COVID-19-related deaths are classified under different causes. Populations also differ in whether deaths of suspected COVID-19 cases are included in official statistics and in post-mortem practices when an infection is suspected. 41 Some populations only report deaths occurring in hospitals, neglecting a potentially sizeable proportion of deaths occurring in institutional settings and at home. 42 Most populations currently report all deaths to confirmed SARS-CoV-2 infections as COVID-19 deaths for this database, but the underlying cause of death eventually reported on the death certificate may differ in patients with severe comorbidities. To mitigate biases and misinterpretations due to different practices and definitions, such information is constantly updated and documented in the metadata of the database which are freely accessible to users. Further, a supplementary data quality metrics file contains a suite of data quality indicators that is easily merged with the main output data. Quality metrics include age-reporting completeness, some indicators on how aggressive age harmonization is, and two positivity measures from Our World in Data database on COVID-19 testing. 26 All of these issues compromise the comparability of the data contained within the COVerAGE-DB, both across populations at any given time and within populations over time. That is, the database enables direct calculation of agespecific CFRs, but one must be careful when making comparisons. Care must also be taken not to interpret calculated CFRs as infection fatality ratios, the latter of which include both detected and undetected SARS-CoV-2 infections in the denominator. Proper estimation of incidence and fatality, and of total demographic impacts, will likely require triangulating data across numerous sources as these become available. To this end, the COVerAGE-DB was designed to be easily merged with other databases such as the Our World In Data testing or excess mortality data, 26 the COVID-19 dashboard of Johns Hopkins, 43 the World Population Prospects database 44 and the Short Term Mortality Fluctuations database. 40 Moreover, given that we have near-complete time series capturing the whole pandemic curve in some places, careful modelling of lag structures might allow some of these data-driven biases to be estimated.

Data resource access
Both merged input and harmonized output files can be downloaded directly from the OSF site [https://osf.io/ mpwjq doi: 10.17605/OSF.IO/MPWJQ, which contains a folder called 'Data' with four files of primary data. Figure 4 shows where to find the files in the OSF repository.
Each of the main data files has a stable link (see Table 1) which always points to the most recent version. Each file is a zipped csv file by the same name. For stable links to download particular versions, click on the version number in the Version column seen in Figure 4. Users can note versions either by referring to timestamps provided in the headers of data files or by referring to OSF file version numbers, which increment with each daily update.
A data dictionary is given in both the OSF wiki [https:// osf.io/mpwjq/wiki/home/] and the Method Protocol. 2 Files are shared in csv format to be as universally accessible as • The database is in continuous development. It includes data since January 2020, and as of 7 January 2021, it includes 108 countries and 371 subnational areas.
• The database also documents variations in definitions of all input data and indicators of reporting completeness across sources and over time. possible. A guide to getting started using the data in R is also provided [https://bit.ly/3g8nIVU], to merge COVerAGE-DB with other databases, and tips for other statistical packages may also be added. Users are encouraged to reach out for further information or advice on using the database, or to express interest in the project at: [coverage-db@demogr.mpg.de].