ERAIZDA: a model for holistic annotation of animal infectious and zoonotic diseases

There is an urgent need for a unified resource that integrates trans-disciplinary annotations of emerging and reemerging animal infectious and zoonotic diseases. Such data integration will provide wonderful opportunity for epidemiologists, researchers and health policy makers to make data-driven decisions designed to improve animal health. Integrating emerging and reemerging animal infectious and zoonotic disease data from a large variety of sources into a unified open-access resource provides more plausible arguments to achieve better understanding of infectious and zoonotic diseases. We have developed a model for interlinking annotations of these diseases. These diseases are of particular interest because of the threats they pose to animal health, human health and global health security. We demonstrated the application of this model using brucellosis, an infectious and zoonotic disease. Preliminary annotations were deposited into VetBioBase database (http://vetbiobase.igbb.msstate.edu). This database is associated with user-friendly tools to facilitate searching, retrieving and downloading of disease-related information. Database URL: http://vetbiobase.igbb.msstate.edu


Introduction
The 21 st century continues to be the era of big data involving not only the omics technologies but also applies to epidemiology (1)(2)(3) and public health (4)(5)(6)(7)(8)(9) fields. Traditionally, the large volumes of structured or unstructured data generated from different fields are independently deposited into discipline-specific resources for use by the intended communities. Knowledge gaps result across related fields due to lack of connectivity in managing distinct but interrelated data. Take, e.g., the impact V C The Author(s) 2015. Published by Oxford University Press.

Page 1 of 10
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
(page number not for citation purposes) caused due to lack of knowledge of the emerging and reemerging animal infectious and zoonotic diseases (ERAIZD). It is clearly understood that ERAIZD continue to pose major threats to animal health, human health and global health security (10)(11)(12)(13)(14). To narrow the knowledge gaps and accelerate wide-ranging discoveries from interrelated data, it is essential that data integration is done first.
As the ERAIZD-related big data continue to accumulate, there is an urgent need for an open-source unified resource associated with user-friendly tools to facilitate data usage. The utility of trans-disciplinary integrated disease data allows investigators to quantify key characteristics such as incubation periods, heterogeneity in transmission rates, duration of infections and the existence of high-risk groups (2). At the epidemiological level, there have been several efforts to create global enhanced systems that address ERAIZD including sharing of health risk information at the animal-human-ecosystem interface. Global early warning and response system (GLEWSþ) integrates information from the Food and Agriculture Organization (FAO) of the United Nations, World Organization for Animal Health (OIE) and the World Health Organization (WHO) Global Alert and Response (15,16). Another early warning system is the Program for Monitoring Emerging Diseases (ProMED)-mail, an internet-based reporting program of the International Society for Infectious Diseases (ISID), which is dedicated to rapid global dissemination of up-to-date expert curated information on outbreaks of infectious diseases (17)(18)(19). The ProMed community obtains information from multiple sources including media reports, official reports, online summaries, local observers, etc. These reports are screened, reviewed and validated by in-country infectious disease experts before posting to the website. Additionally, agencies such as the Centers for Disease Control and Prevention (CDC) (20), the leading federal public health agency in the United States and the associated global disease detection program (21) provide invaluable information for public use. Moreover, a joint initiative of FAO and OIE established the Global Framework for progressive control of Transboundary Animal Diseases (22) to empower regional alliances in the fight against international spread of animal diseases.
At the molecular level, various infectious agents have different genetic makeups that may cause similar or different signs and symptoms. The molecular diagnosis of infectious diseases using nucleic acid-based technologies such as the polymerase chain reaction, ligase chain reaction, transcription-mediated amplification and nucleic acid sequence-based amplification have become tools in understanding key molecular factors related to ERAIZD causal agents. These techniques provide highly accurate diagnosis of infections caused by bacteria, viruses, fungi, parasites and others (23)(24)(25)(26)(27)(28). For example, the CDC has been using advanced molecular detection technology, such as wholegenome sequencing as means of surveying the genetic differences between isolates in various pathogenic organisms. This has led to the establishment of a public database for sharing information about potentially deadly diseases such as anthrax and brucellosis (2). This type of genomic surveillance is more accurate, reliable and a more cost effective means of diagnosing known, emerging and reemerging infections, as well as characterizing foreign or unknown subtypes. However, these molecular techniques only demonstrate the presence of pathogen (or subtypes) and not the presence of disease.
Emerging and reemerging diseases are caused by diverse range of biological agents that exit their reservoirs, enter susceptible hosts through different routes and cause tissue damage. Provision of a resource that links findings from epidemiological and basic research investigations could help answer fundamental questions regarding factors responsible for disease development and so forth. We have developed a model for annotating multidimensional diverse ERAIZD data from a large variety of sources and integrating it into an open-access unified resource we call ERAIZDA (Emerging and Reemerging Animal Infectious and Zoonotic Disease Annotations). The model is referred to as 'ERAIZDA model' throughout the article. We envision that availability and utility of an ERAIZDA resource will (i) promote global data sharing beyond usual boundaries which could lead to improved interactions in multiand interdisciplinary research teams; (ii) furnish the animal/human health and veterinary research communities with integrated disease information for developing effective joint policies and guidelines for controlling animal infectious and zoonotic diseases; (iii) lead to better planning of coordinated research strategies including setting up research priorities, developing testable data-driven hypotheses and identification of effective and efficient integrative interventions and (iv) lead to establishing database and data sharing standards used by multi-and interdisciplinary communities for collaborative data sharing, solving much of the current problems in data curation and database application programming interface and web services among related data sources. We strategically selected brucellosis to demonstrate implementation of the ERAIZDA model. Globally, brucellosis is one of the most significant zoonotic diseases (29) with public health challenge and economic impact. It has a worldwide distribution and is absent in only few countries (30). It is listed in the OIE (31,32) as one of the 2015 notifiable diseases and in the WHO as one of the seven neglected endemic zoonoses (33).

Structure of ERAIZDA model
The ERAIZDA model adopts an annotation approach that takes into consideration all sources of information for an infectious disease that occurs at the animal-human-pathogenecosystem interface. Identification of information sources that collectively provide animal infectious, and zoonotic diseases is a first step when implementing this model. Five main sources of disease information were identified and designated as PADER (Figure 1), where P ¼ publications (articles in journals, books or conference proceedings), A ¼ agencies documents (from WHO, CDC, FAO, OIE, GLEWSþ, ISID, etc.), D ¼ databases [such as National Center for Biotechnology Information (NCBI), WAHID, Ontologies and host-pathogen interaction], E ¼ expert validated information (from health departments, veterinary professionals, reviewed ProMed-mails, reviewed health news, educational resources, etc.) and R ¼ reports officially documenting disease information (non-electronic reports). Each individual source is then manually annotated by highly skilled biocurators to generate associations describing the disease features, causal agents and molecular data such as biomarkers. These annotations are then organized and deposited into a unified resource.
Comprehensive annotation of animal infectious and zoonotic diseases using the PADER approach is a complex process that involves coordinated and dedicated effort of diverse expertise and skills to integrate multidisciplinary disease information into a unified resource. The complete ERAIZDA model herein described ( Figure 2) demonstrates such involvedness. Briefly, we assumed that once a disease is observed in a community, the process of collecting the disease information and reporting of the incidence starts immediately. This is followed by disease management strategic planning by all key players including health personnel, epidemiologists and clinical or basic scientists who may conduct individual, multidisciplinary or interdisciplinary investigations on the disease. Traditionally, findings from these investigations are documented in different PADER sources. Annotation of these diverse sources could yield varied disease information that can be integrated into a unified ERAIZDA resource. Provision of user-friendly tools facilitates searching, retrieving and/or downloading specific information from the ERAIZDA resource. This information can then be thoroughly reviewed and prioritized to facilitate better informed decision making for new interventions or improving the existing ones.

Annotation parameters
The ERAIZDA model is designed for comprehensive annotation of animal infectious and zoonotic diseases from diverse sources (PADER) to generate multidisciplinary disease-related information. We have defined parameters to be used as reference to guide the annotation process in each PADER source. These parameters are grouped into three broad categories representing (i) diseases, (ii) pathogens (causal agents) and (iii) molecular information, specifically disease biomarkers. The model adopts different terminologies for describing infection, frequency and typology of an infectious or zoonotic disease (Table 1). A detailed description of all parameters is given (see Supplementary File S1). We acknowledge that debate may exist around the description of these parameters which could lead to better and standardized disease annotation terms.
Implementation of the ERAIZDA model

Defining disease parameters
Brucellosis was chosen as a classical example of a significant endemic zoonotic disease to demonstrate the implementation of the ERAIZDA model. The annotations of brucellosis described herein are not by any means exhaustive but serve as model for annotating animal infectious and zoonotic diseases from diverse sources.
Preliminary annotation of selected PADER sources enabled us to establish important parameters ( Table 2, Supplementary File S1) for guiding comprehensive annotation of animal infectious and zoonotic diseases using this model. In order to provide structured annotations the parameters were organized in an Excel spreadsheet where rows represented diseases and columns represented parameters.

Google first-pass analysis
The process of annotation started by searching important keywords [brucellosis OR (brucellosis) OR (brucellosis zoonotic disease) OR (brucellosis in animal) OR (brucellosis in human)] in Google as First-Pass Analysis to get a clear picture about the expected sources of information and initial knowledge of brucellosis. The search returned 180 000 google hits. Clustering of the top 100 hits using PADER codes identified the most important sources of brucellosis information (Figure 3; Supplementary File S2).
From the First-Pass Analysis, we were able to generate preliminary brucellosis information that enabled us establish custom terms (values) for describing disease parameters (Supplementary File S1). In the process of annotating brucellosis we found that some parameters may not apply or be relevant to some diseases and data may not be available for some parameters. In such cases, the corresponding fields are represented with triple hyphen (-) until such data become available in the subsequent updates.

Quantifiable sources of brucellosis information
The primary quantifiable sources of reliable brucellosis information, including epidemiology of the disease and molecular characterization of the causal agents, are a combination of articles published in peer-reviewed journals and controlled databases represented in the NCBI. The NCBI PubMed database is the most famous gateway for browsing information published in scientific journals.

Entrez records
Entrez is the NCBI's primary text search and retrieval system that integrates diverse databases such as PubMed and Taxonomy databases with molecular databases such as DNA and protein sequences, genes, genomes, singlenucleotide polymorphs and gene expression data (45). Traditionally, each species represented in the NCBI Taxonomy Database is identified with a unique name and a taxon-specific unique identifiers that distinguish one species from the other. Unique names and identifiers of the Brucella species facilitated identification of species-specific molecular records of the causal agents of brucellosis (Supplementary File S6). These molecular data, especially the nucleotide (transcripts) and amino acid (protein) sequences, are very important in identifying unique features of indistinguishable causal agents. The ERAIZDA model includes a link for accessing most current molecular data for each annotated causal agents indexed in the Entrez databases.

Preliminary biomarkers of brucellosis
The preliminary data of brucellosis biomarkers were annotated from a sample of only 10 PubMed central articles (  annotations including experimental texts that support the annotation is available as Supplementary File S7. This data can also be retrieved from the VetBioBase database (46) using the dseMARKERS tool.

Integration of brucellosis information into unified resource
To facilitate management and sharing of preliminary annotations of brucellosis, we organized the raw data into MySQL relational tables representing disease features, causal agents and molecular biomarkers ( Figure 4). Collectively, these tables formed a foundation for establishing ERAIZDA, a unified resource for integrating animal infectious and zoonotic disease annotations. The tables were then loaded into the VetBioBase database (46) for public use. These preliminary annotations enabled us to develop seven user-friendly tools for searching, retrieving and/or downloading specific information from the ERAIZDA dataset.

Discussion
We present the ERAIZDA model, a comprehensive model for annotating animal infectious and zoonotic diseases. This model uses over 50 pre-defined parameters (Supplementary File S1) as a reference to guide the disease annotation process from diverse sources referred (in this article) as PADER. Each parameter describes significant information related to a particular infectious and/or zoonotic disease. The model encourages data integration into an open-access unified resource for use by the animal/human and veterinary research communities to accelerate integrative translational interventions aimed at achieving better understanding of infectious and zoonotic diseases. Using brucellosis as an example, we demonstrated that multidisciplinary disease-related information from epidemiological investigations, clinical studies, case reports and basic science functional genomics investigations can be linked together and made available through a single resource. Preliminary annotations of brucellosis were used as a foundation to create a unified resource known as ERAIZDA. Our future focus is to use the ERAIZDA model to generate annotations of all animal infectious and zoonotic diseases of national and international priority, starting with the notable diseases listed by the World Organization for Animal Health (OIE).
The central proposition of integrating animal disease data is to enable all key players in the disease management process to quickly access disease information and process it to uncover hidden value, narrow the knowledge gap and make data-driven decisions. Using the pre-defined parameters (Supplementary File S1) as reference is the most feasible way to ensure that every piece of available information is represented. It is most likely that a single source of disease information would not be able to provide all the parameters (diseases, causal agents, molecular biomarkers, etc.). This is why PADER becomes such an important approach. For example, when annotating molecular biomarkers using the ERAIZDA model, we recommend including supporting evidence. The best evidence is to link the biomarker with a reference such as PubMed articles and text that will enable users to have confidence in the annotated biomarker. For instance, sequences with amino acid replacement may signify useful molecular biomarkers for detecting mutations that could also be basis for detecting antimicrobial resistance and virulence determinants. Adding reference to the sequence information is particularly important, especially if the annotation involves cross referencing between related diseases. Here is an example of cross-species annotation from two articles; PubMed ID 21151656 (PubMed Central ID PMC2997342) and PubMed ID 17021120 (PubMed Central ID PMC1594805): 'Mycobacterium bovis, a bacterium that causes bovine tuberculosis is naturally resistant to pyrazinamide (47) due to its inability to produce pyrazinamidase enzyme needed to convert pyrazinamide into active form of the antimicrobial agent (48). However, point mutations in the rpoB (DNA-directed RNA polymerase subunit beta) and katG (catalase-peroxidase) genes are considered potential biomarkers for resistance to rifampicin and isoniazid in human Mycobacterium tuberculosis (49)'. Adding a short supporting text like this enables researchers to quickly consider what to do next. In this example, researchers may use the resistance features to distinguish isolates of M. bovis from M. tuberculosis.
Although ERAIZDA model intends to link disease names with Disease Ontology terms, we are aware that there are other ontologies that are relevant to the ERAIZDA model. Since data generated using the ERAIZDA model is intended to benefit not only the research community but also a wide range of animal and human health-based communities, we have been very careful not to link ontological terminologies that are not familiar to the audience. Indeed, the model is expected to provide very specific quantitative data guided by predefined parameters (as described in Supplementary File S1) that can be used by ontology developers to improve the existing ontologies such as pathogen Transmission Ontology and Symptom Ontology or designing new ontologies. The model also integrates some aspects of hostpathogen interactome features related to the pathogen and hosts at molecular level. For example, we understand that virulence of a pathogen is one of possible effects of hostmicrobe interaction. In that case, a microbial virulence could depend on host factors to be effective. This is why the model links possible molecular biomarkers including virulence factors with pathogen characteristics to ascertain the effects of interaction on, e.g., the microbial virulence or pathogenicity in relation to effects induced by host genes.
It should be emphasized that the integrated data by itself is not useful, unless it is freely accessible by the intended communities for further interventions. Userfriendly tools are provided to offer such access to the interested communities. Generally, the ERAIZDA model encourages multidisciplinary collaboration of all key players involved in the disease management process to share research and non-research data without their usual boundaries. It is anticipated that the ERAIZDA data will continue to grow progressively and sustainably as more animal infectious disease annotations are added, and the research community will utilize and benefit from this data. Although curation is one of the most likely challenging aspects of maintaining a sustainable ERAIZDA database, having a uniform annotation procedure can lessen the challenge. To maintain accuracy and uniformity of annotations, curators are encouraged to use special forms containing the pre-defined parameters (Supplementary File S1) to guide the annotation process in each PADER source. Having a standard annotation procedure will also facilitate data integration and updates.

Conclusion
Annotation of animal infections and zoonotic diseases from diverse sources is a central initiative to reveal important information for new interventions. Most importantly, integrating the disease annotations into an open-access unified resource and provision of user-friendly tools for searching, retrieving and downloading specific information could save users considerable time and effort. We expect to increase the depth and breadth of the ERAIZDA dataset by continuing to annotate additional animal infectious diseases of national and international priority. We believe that since zoonoses can infect both animals and humans, both the medical, public health, basic research and veterinary communities will definitely benefit from this unified resource. We encourage interested communities to use the ERAIZDA data to develop additional models needed to understand the mechanism and pathology of animal infectious and zoonotic diseases and translation of functional genomics of infectious diseases into biological value.

Supplementary data
Supplementary data are available at Database Online.