Future-proofing and maximizing the utility of metadata: The PHA4GE SARS-CoV-2 contextual data specification package

Abstract Background The Public Health Alliance for Genomic Epidemiology (PHA4GE) (https://pha4ge.org) is a global coalition that is actively working to establish consensus standards, document and share best practices, improve the availability of critical bioinformatics tools and resources, and advocate for greater openness, interoperability, accessibility, and reproducibility in public health microbial bioinformatics. In the face of the current pandemic, PHA4GE has identified a need for a fit-for-purpose, open-source SARS-CoV-2 contextual data standard. Results As such, we have developed a SARS-CoV-2 contextual data specification package based on harmonizable, publicly available community standards. The specification can be implemented via a collection template, as well as an array of protocols and tools to support both the harmonization and submission of sequence data and contextual information to public biorepositories. Conclusions Well-structured, rich contextual data add value, promote reuse, and enable aggregation and integration of disparate datasets. Adoption of the proposed standard and practices will better enable interoperability between datasets and systems, improve the consistency and utility of generated data, and ultimately facilitate novel insights and discoveries in SARS-CoV-2 and COVID-19. The package is now supported by the NCBI’s BioSample database.


Findings
The importance of contextual data for interpreting SARS-CoV-2 sequences First identified in late 2019 in Wuhan, China, the SARS-CoV-2 virus has now spread to virtually every country and territory in the world, resulting in millions of confirmed cases, and deaths, globally [1,2].Understanding, monitoring, and preventing transmission, as well as the development of vaccines and effective therapeutic options, have been primary goals of the public health response to SARS-CoV-2.
Tracking the spread and evolution of the virus at global, national, and local scales has been aided by the analysis of viral genome sequence data alongside SARS-CoV-2 epidemiology.Large-scale sequencing efforts are often formalized as consortia across the world, including the COG-UK in the UK [3], SPHERES in the USA [4], CanCOGeN in Canada [5], the Latin American Genomics SARS-CoV-2 Network [6,7], 2019nCoVR in China [8], the South Africa NGS Genomic Surveillance Network [9], AusTrakka in Australia and New Zealand [10], and INSACOG in India [11].In addition to these initiatives, many agencies, universities, and hospital laboratories around the world are also sequencing and sharing sequence data at an unprecedented pace.Deposition of these sequences into public repositories such as the Global Initiative on Sharing All Influenza Data (GISAID) and the International Nucleotide Sequence Database Collaboration (INSDC) has enabled rapid global sharing of data [12,13].At the time of writing, 174 countries had undertaken open sequencing initiatives (GISAID accessed 2021-06-23) depositing 2,057,675 sequences, which are being reused and analysed on a massive scale.The open data sharing paradigm has had tremendous success in the genomic epidemiology of foodborne pathogens [14,15] and has the potential to reveal a deeper understanding of SARS-CoV-2 origin, pathogenicity, and basic biological characteristics when submissions from environmental samples and wild hosts are included alongside human clinical samples [16].
SARS-CoV-2 sequencing, analysis, and open sharing have played a crucial role in a number of developments during the pandemic, such as dispelling misinformation about the origins of the virus [17], the identification and surveillance of variants of concern [18,19], the improvement of diagnostic performance and rapid testing [20][21][22], and the development of vaccines, which are currently being distributed in the largest global vaccination program the world has ever seen [23].Viral genomic sequences are also being used to understand transmission and reinfection events [24], as well to monitor the prevalence and diversity of lineages during different exposure events and in different settings, e.g., animal reservoirs [25], long-term care facilities [26][27][28], healthcare and other work sites [29][30][31][32][33], and conferences and other public gatherings [34], as well as before and after public health responses (e.g., border controls and travel restrictions, lockdowns and quarantines, vaccination), through successive waves of infections [35][36][37][38][39][40][41][42][43][44][45][46].However, it is critical to note that public health sequence data are of limited value without accompanying contextual metadata.
Contextual data consist of sample metadata (e.g., collection date, sample type, geographical location of sample collection), as well as laboratory (e.g., date and location testing, cycle threshold [CT] values), clinical outcomes (e.g., hospitalization, death, recovery), epidemiological (e.g., age, sex, exposures, vaccination status), and methods (e.g., sampling, sequencing, bioinformatics) data that enable the interpretation of sequence data.High-quality contextual data are also crucial for quality control.For example, detecting systematic batch effect errors related to certain sequencing centres and methods can help evaluate which variants represent real, circulating viruses, as opposed to artefacts of sample handling or sequencing that may arise owing to different aspects of experimental design, laboratory procedures, bioinformatics processing, and applied quality control thresholds [47][48][49].
Good data stewardship practices are critical not only for auditability and reproducibility but for posterity-documenting critical information about samples, methods, risk factors and outcomes, and so forth can help future-proof information used to build a roadmap for dealing with future public health crises.Contextual data, however, are often collected on a project-specific basis according to local needs and reporting requirements, which results in the collection of different data types at different levels of granularity, with different meanings and implicit bias of variables and attributes.Furthermore, the information is often collected as free text or, if structured, according to organization or initiative-specific data dictionaries, using different fields, terms, formats, abbreviations, and jargon.
The variability in the way information is encoded in private databases tends to propagate to public repositories, which makes the information more difficult to interpret and to use.There are different existing standards that can be used to structure contextual data, like minimum information checklists (MIxS [50], MIGS [51], the NIAID/BRC Project, and Sample Application Standard [52]) and various interoperable ontologies (OBO Foundry [53]), which make information easier to aggregate and reuse for different types of analyses.However, these attribute packages and metadata standards developed by different organizations are usually scoped to cover as many use cases and pathogens as possible and, as such, can include fields of information not applicable to SARS-CoV-2, or that may be subject to privacy concerns, or exclude fields commonly used in public health surveillance and investigations.Because different types of contextual data are subject to different ethical, practical, and privacy concerns, not all components of existing standards are immediately or widely collectable and shareable.As a result, the range of generic metadata standards being applied to SARS-CoV-2 data presents challenges for data harmonization [54] and analysis critical for fighting the disease and ending the pandemic.
In light of these challenges, PHA4GE has identified a need for a fit-for-purpose, open-source SARS-CoV-2 contextual data specification that can be used to consistently structure information as part of good data management practices and for data sharing with trusted partners and/or public repositories.rates existing community standards with an emphasis on SARS-CoV-2 public health needs and ensuring privacy while maximizing information content and interoperability across datasets and databases to better enable analyses to fight COVID-19.The specification package also contains a number of accompanying materials such as standard operating procedures, tools, a reference guide, and repository submission protocols (protocols.io) to help put the standard into practice.

SARS-CoV-2 Contextual Data Specification: The Framework
The purpose of the PHA4GE SARS-CoV-2 specification is to provide a mechanism for consistent structure, collection, and formatting of fields and values containing SARS-CoV-2 contextual data pertaining to clinical, animal, and environmental samples.We emphasize that the purpose of this specification is not to force data sharing but rather to provide a framework to structure data consistently across disparate laboratory and epidemiological databases so that they can be harmonized for different uses (Fig. 1).Data sharing is just one use case and can involve sharing between divisions within a single agency, sharing between partners based on memorandums of understanding, or submission to public repositories.
The PHA4GE SARS-CoV-2 contextual data specification was created through broad consultation with representatives from public health laboratories, research institutes, and universities in 11 countries (Argentina, Australia, Brazil, Canada, Germany, Nigeria, Portugal, South Africa, Switzerland, the United Kingdom, the United States of America) who are involved with SARS-CoV-2 genome sequencing and analysis efforts at various scales.Based on this consultation and consensus, the specification contains different fields covering a wide array of data types described in Box 1 (Fig. 1).The specification attempts to harmonize different data standards (e.g., INSDC, GISAID, MIxS, MIGS, Sample Application Standard) by reusing fields or mapping to fields, as much as possible.Because PHA4GE embraces FAIR data stewardship principles (Findability, Accessibility, Interoperability, and Reuse of digital assets), we strived to implement FAIR principles in the design and implementation of the specification for data management and data sharing.At their core, these principles emphasize machineactionability and consistency of data and are critical for dealing with the volume and complexity of genomic sequence and contextual data.Principles of FAIR data stewardship that have been implemented include improving machine-actionability of data by using a formal, accessible, shared, and broadly applicable language for knowledge representation, reusing existing standards and ontology-based vocabulary to increase interoperability, providing a data use license, capturing data provenance, and making all resources open, free, and widely accessible.
The versioned specification is available as a contextual data collection template (.xlsx) and in machine-amenable JSON format from GitHub (version 3.0.0)[55].The collection template also offers standardized terms for a number of fields in the form of pick lists.The fields are colour-coded to indicate required (yellow), strongly recommended (purple), or optional status (white).Fields useful for surveillance were prioritized as "required".Formats for data elements like dates are also prescribed according to international standards (e.g., dates should be formatted according to ISO 8601).
The template is also supported by several materials such as term and field-level Reference Guides (available as tabs in the col-lection template Excel workbook), which provide definitions, data entry guidance, and examples of usage [55].The field-level Reference Guide also provides mapping of PHA4GE fields to existing contextual data standards, highlighting public health and SARS-CoV-2-specific fields that were missing, as well as fields in those other standards that were considered out of scope.
The Open Biological and Biomedical Ontology (OBO) Foundry is a community of researchers who use a prescribed set of principles and practices to develop a wide range of interoperable ontologies focused on the life sciences [56].Fields and terms in the specification have been mapped to existing OBO Foundry ontology terms, and where required, new ontology terms have been developed and are being made available in different application and domain-specific ontologies within The Foundry (see Table 1 for a list of source ontologies).As of version 3.0.0and beyond, terms in pick lists provided in the collection template are presented with corresponding ontology identifiers in the format "Label [ontology ID]", e.g., Blood [UBERON:0 000 178].Axioms and additional cross references to ontologies and existing standards are actively being developed in collaboration with community developers.We anticipate that our contributions to these freely available, open-source resources will be of use to the COVID-19 research community.
Protocols have also been created and are openly available on protocols.io[57], including a curation Standard Operating Procedure (SOP) containing instructions for using the collection template, as well as guidance for a number of privacy and practical concerns.A series of versioned SARS-CoV-2 sequence and contextual data submission protocols and accompanying instructional videos for how to prepare submissions and navigate through the various submission portals for GISAID, NCBI, and EMBL-EBI are also provided.
A mapping file indicating which PHA4GE fields correspond to contextual data elements recommended by the World Health Organization has been provided to help data providers comply with international guidance [58].This mapping file also includes tabs indicating which PHA4GE fields correspond to those found in different repository submission forms to facilitate data transformations for submissions.Such transformations can be automated using a contextual data harmonization application called the DataHarmonizer [59].PHA4GE has worked with the developers of the DataHarmonizer to offer the PHA4GE standard as a template in the tool (I.Gill et al., in preparation).Users can standardize and validate entered data and export it as GISAID and NCBIready submission forms (BioSample, SRA, GenBank, and GenBank source modifier forms).It should be noted that other excellent contextual data transformation tools have been developed by the community, such as METAGENOTE, multiSub, and a GISAID-to-ENA conversion script [60][61][62].
The different specification package materials are outlined in Table 2.

Getting Started-How To Use the Standard
In designing the specification we first considered the goals of data collection and harmonization.Consulted stakeholders believed that the primary priority of standardizing data should be improved support for SARS-CoV-2 genomic surveillance activities and the submission of sequence data and minimal metadata to public repositories.The two most important attributes for tracking transmission from pathogen genomic data are temporal information describing when a sample was collected and spatial information describing where a virus was sampled.Contextual data can be captured and structured using the PHA4GE specification so that they can be more easily harmonized across different data sources and providers.Different subsets of the harmonized data can be (i) shared with public repositories, e.g., GISAID and INSDC; (ii) shared with trusted partners, e.g., national sequencing consortia, public health partners; and (iii) kept private and retained locally with the potential for sharing in the future for particular surveillance or research activities.While fields have been colour-coded in the template to indicate whether they are considered "required," "strongly recommended," or "optional," how the specification is implemented and whether any of the data are shared is ultimately at the discretion of the user.Box 1 describes the information types covered in the full specification.Comparisons of minimal contextual data requirements across different national sequencing efforts, as well as submission requirements for INSDC and GISAID databases, yielded a minimal set of 14 fields that have been annotated as "required" in the specification (colour-coded yellow in the collection template).The required fields, corresponding definitions, and guidance notes are described in Table 3.A number of other fields have been annotated as "strongly recommended" (colour-coded purple in the collection template) for capturing sample collection and processing methods, critical epidemiological information about the host, and acknowledging scientific contributions.Fields colour-coded white are considered optional.
Because many contextual data fields are stored in different locations and databases (e.g., LIMS, epidemiology case report forms and databases), a benefit of implementing the PHA4GE collection template is that it enables the capture of these different pieces of information in one place.The collection template also offers pick lists for a variety of fields, e.g., a curated INSDC country list for "geo_loc name (country)," the standardized name of the virus under the "organism" field (i.e., severe acute respiratory coronavirus 2), and a multitude of standardized terms for sample types (anatomical materials and sites, environmental materials and sites, collection devices and methods).The "purpose of sequencing" field provides standardized tags that can be used to highlight sampling strategy criteria (e.g., baseline surveillance [random sampling] or targeted sequencing [non-random sampling]), which are very important for understanding bias when interpreting patterns in sequence data.The pick lists provided are neither exhaustive nor comprehensive but have been curated from current literature representing active sampling and surveillance activities.
If a pick list is missing standardized terms of interest, the reference guide also provides links to different ontology look-up services, enabling users to identify additional standardized terms.The reference guide provides definitions for the fields, additional guidance regarding the structure of the values in the field, and any suggestions for addressing issues pertaining to privacy and identifiability.The curation SOP provides users with step-by-step instructions for populating the template, looking up standardized terms, and how best to structure sample descriptions.The SOP also highlights a number of ethical, practical, and privacy considerations for data sharing.

Implementation of the PHA4GE specification around the world
The amount of and manner in which the specification is implemented is ultimately at the discretion of the user.To date, versions of the specification are being implemented in the CanCO-GeN (Canada) and SPHERES (USA) SARS-CoV-2 sequencing initiatives, the AusTrakka (Australia and New Zealand) data sharing platform [1][2][3], and by the Global Emerging Pathogens Treatment Consortium (Africa) [63], the African Centre of Excellence for Genomics of Infectious Diseases (ACEGID) in Nigeria [64], the Baobab LIMS [65] at the South African National Bioinformatics Institute (SANBI) [66], and the Latin American Genomics Network [67].
Canada is implementing a version of the PHA4GE specification to harmonize contextual data across all data providers for national SARS-CoV-2 surveillance [5].Harmonized contextual information is provided by different jurisdictions and stored in the national genomics surveillance database at the Public Health Agency of Canada's National Microbiology Laboratory.A hypothetical worked example is provided to demonstrate how free text information can be structured according to the specification and how subsets of the contextual data can be shared according to jurisdictional policies (Fig. 2).
While the primary use case of the specification is for public health sequencing, the sample collection fields have been developed to enable capture of information for a wide range of sample types, including environmental samples (e.g., swabs of hospital equipment and patient rooms, wastewater samples) and nonhuman hosts (e.g., wildlife, agricultural animal samples).

Submitting Data to Public Sequence Repositories
Many existing SARS-CoV-2 sequences have only been deposited in GISAID, with a proportion of submitters also depositing matching raw read data in the INSDC (i.e., NCBI, European Molecular Biology Laboratory-European Bioinformatics Institute [EMBL-EBI], and DNA Data Bank of Japan [DDBJ]).While consensus genomes are widely deposited and used for public surveillance purposes, raw read data are critical for comparing methods and assessing reproducibility, as well as identifying minor variants.Linkage of contextual data to consensus sequences as well as raw data in public repositories is vital.
Within the INSDC, the contextual data are stored as accessioned BioSamples [68] with a consistent set of attribute names and standardized values.BioSamples add value, promote reuse, and enable interoperability of data submitted from laboratories that may only be connected by following the same metadata standard.The INSDC databases have until recently provided a generic pathogen metadata template for the BioSample that is heavily utilized for bacterial genomic surveillance [69].GISAID uses a different format and data structure for associating metadata primarily for influenza surveillance and now extended to include SARS-CoV-2.The ENA provides a virus metadata checklist (ENA virus pathogen reporting standard checklist) developed as part of the COMPARE project [70], which is very similar to the GISAID submission requirements.
Building on these existing standards, a metadata specification for SARS-CoV-2 genomic surveillance was developed that is broad enough for internal laboratory use while providing mechanisms for mapping/transforming standardized contextual data for public release to INSDC and GISAID.Recently, PHA4GE worked with NCBI to develop a dedicated SARS-CoV-2 BioSample submission package in the NCBI Submission Portal, which incorporates many fields from the PHA4GE standard [71].The Genomics Standards Consortium will also align its forthcoming "MIxS for SARS-CoV-2" package with this specification.EMBL-EBI will also offer the PHA4GE standard to submitters as one of its validated checklists.Taken together, the PHA4GE specification has already had widespread impact on contextual information data structures around the world.
The detailed mapping of PHA4GE fields to public repository submission requirements, as well as guidance and advice, are available as supporting documents (see Table 1).We have also provided detailed protocols for data submission to the three participating repositories, GenBank/SRA (NCBI), ENA (EMBL-EBI), and GISAID.An overview of how the PHA4GE specification is integrated into public repository submissions is presented in Fig. 3    The specification has been used to submit standardized contextual data to different repositories by laboratories and sequencing initiatives globally.A selection of accession numbers for submissions to different repositories is provided in Table 4.

Conclusion
The collective response to the SARS-CoV-2 pandemic has resulted in an unprecedented deployment of genomic surveillance worldwide, bringing together public health agencies, academic research institutions, and industry partners.This unified action provides opportunities to more effectively understand and respond to the pandemic.Yet it also provides an enormous challenge because realizing the full potential of this opportunity will require standardization and harmonization of data collection across these partners.With our SARS-CoV-2 metadata specification we have endeavoured to create a mechanism for promoting consistent, standardized contextual data collection that can be applied broadly.We envision that given the increased uptake, this specification will improve the consistency of collected data, mak-ing information reusable by agencies as they continue working towards an increased understanding of SARS-CoV-2 epidemiological and biological characteristics, and harmonizing them such that community-based data-sharing efforts are not excessively burdened.We anticipate that the experience and lessons learned creating the specification package for SARS-CoV-2 will better enable the rapid development and deployment of pathogen-specific standards for public health pathogen genomic surveillance in the future.

Methods
The PHA4GE SARS-CoV-2 data specification was developed by first comparing existing metadata standards (e.g., MIxS/MIGS, the NIAID/BRC Sample Application Standard) and various sequence repository submission requirements (e.g., GISAID, INSDC), as well as national and international case report forms.
A gap analysis was performed to identify SARS-CoV-2 public health surveillance data elements that were missing in available standards.Fields in existing standards that were deemed to be out of scope were excluded from the specification.Terms for pick lists were sourced from public health documents, the literature, and, when available, various interoperable ontologies (OBO Foundry).The fields and terms from the gap analysis were structured in the collection template (.xlsx).Field definitions, guidance for use, examples, and mappings to various standards were developed as part of the Reference Guides provided in separate tabs in the template workbook.Vocabulary lists were also provided in a separate tab in the template workbook to enable validation and to enable users to add terms to pick lists as needed, according to instruc-Downloaded from https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac003/6529104 by guest on 22 February 2022

SAMN20400588
tions provided in the curation SOP.The specification was also encoded as a JSON file.The specification was reviewed by public health, bioinformatics, and data standards experts from different public health agencies, research institutes, and sequencing consortia and adapted according to feedback.Upon request by community members, versioned protocols for public repository submission were created and deposited in protocols.io.
The first version of the specification was made publicly available in August 2020 with a CC-BY 4.0 International attribution license.Iterative improvements were made to a development branch of the specification over the next 10 months as the pandemic evolved, and in response to user feedback and requests.The second major release (2.0) was made publicly available in May 2021.A third major release (3.0) including ontology mappings and the term-level reference guide was made publicly available in December 2021.The PHA4GE template was incorporated into the contextual data harmonization, validation, and transformation tool called The DataHarmonizer through a collaborative effort with the Centre for Infectious Disease Genomics and One Health (Simon Fraser University).Details regarding DataHarmonizer development can be found elsewhere (e.g., [72] and manuscript in preparation (I.Gill et al., in preparation).).

Figure 1 :
Figure 1: Contextual data flow.Contextual data can be captured and structured using the PHA4GE specification so that they can be more easily harmonized across different data sources and providers.Different subsets of the harmonized data can be (i) shared with public repositories, e.g., GISAID and INSDC; (ii) shared with trusted partners, e.g., national sequencing consortia, public health partners; and (iii) kept private and retained locally with the potential for sharing in the future for particular surveillance or research activities.While fields have been colour-coded in the template to indicate whether they are considered "required," "strongly recommended," or "optional," how the specification is implemented and whether any of the data are shared is ultimately at the discretion of the user.Box 1 describes the information types covered in the full specification.

Figure 2 :
Figure 2:The PHA4GE specification is being implemented in CanCOGeN to harmonize contextual data across jurisdictions.(A) CanCOGeN is Canada's SARS-CoV-2 national genomic surveillance initiative.Canada has a decentralized health system, with one federal and 13 provincial/territorial public health jurisdictions.Provinces/Territories have authority over how data are collected, stored, and shared.Every Canadian public health jurisdiction uses different collection instruments (e.g., case report forms), different data management systems, and different pipelines and software to perform bioinformatic analyses.Provinces/Territories share sequencing data and accompanying contextual data with the National Microbiology Lab's national SARS-CoV-2 genomics database (starred) according to a version of the PHA4GE specification for national surveillance activities.(B) Excerpts from two different province-specific case collection forms.Sample type information is collected in data collection instruments using different fields, different terms, at different levels of granularity, using abbreviations and formats.BAL: bronchoalveolar lavage; NPS: nasopharyngeal swab; UTM: universal transport medium.(C) An anonymized example of how the standard consistently structures contextual information and how it is being used for data sharing.The contextual data specification provides a wide variety of fields and pick lists of terms.In the example, the full set of standardized information shown would be shared by the province with the national database.Standardized information in boldface would be shared with public repositories; however select data elements (underscored) would be withheld according to jurisdictional data sharing policies.The specification enables users to harmonize and integrate data provenance, sampling strategy criteria, epidemiological information, and methods.

Figure 3 :
Figure 3: Overview of how the PHA4GE SARS-CoV-2 contextual data specification can be integrated into public repository submission.The PHA4GE collection template provides a one-stop shop for different data types that are important for global surveillance.The protocols provided as part of the specification package describe how PHA4GE fields can be mapped to different repository submission forms.Consensus sequences (FASTA), accompanied by a subset of PHA4GE fields, can be submitted to the GISAID EpiCoV database (A).Consensus sequences (FASTA) (B) as well as raw/processed data (FASTQ, BAM) (C, D) can be submitted to INSDC databases (e.g., GenBank, SRA) with different subsets of PHA4GE fields as part of a BioSample record.BioSamples are propagated throughout INSDC databases.
The specification was developed by consensus among domain experts, and incorpo-

Table 2 :
[55]urces that form the PHA4GE SARS-CoV-2 contextual data specification package[55]mapped to existing metadata standards such as the Sample Application Standard, MIxS 5.0, and the MIGS Virus Host-associated attribute package.Mappings are available in the Reference guide tab.Mappings highlight which fields of these standards are considered useful for SARS-CoV-2 public health surveillance and investigations, and which fields are considered out of scope . PHA4GE recommendations for FAIR SARS-CoV-2 data submissions are as follows: Downloaded from https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac003/6529104 by guest on 22 February 2022

Table 4 :
A selection of accession numbers of harmonized contextual data records submitted to different public repositories