Perspective: Towards Automated Tracking of Content and Evidence Appraisal of Nutrition Research

Robust recommendations for healthy diets and nutrition require careful synthesis of available evidence. Given the increasing volume of research articles generated, the retrieval and synthesis of evidence are increasingly becoming laborious and time-consuming. Information technology could helptoreduceworkloadforhumans.Toguidesupervisedlearninghowever,humanidentiﬁcationofkeystudycharacteristicsisnecessary.Reporting guidelinesrecommendthatauthorsincludeessentialcontentinarticlesandcouldgeneratemanuallylabeledtrainingdataforautomatedevidence retrieval and synthesis. Here, we present a semiautomated approach to annotate, link, and track the content of nutrition research manuscripts. We used the STROBE extension for nutritional epidemiology (STROBE-nut) reporting guidelines to manually annotate a sample of 15 articles and converted the semantic information into linked data in a Neo4j graph database through an automated process. Six summary statistics were computed to estimate the reporting completeness of the articles. The content structure, presence of essential study characteristics as well as the reporting completeness of the articles are visualized automatically from the graph database. The archived linked data are interoperable through their annotations and relations. A graph database with linked data on essential study characteristics can enable Natural Language Processing in nutrition. Adv Nutr 2020;11:1079–1088.


Introduction
Unhealthy diets and poor nutrition are the leading risk factors for poor health worldwide (1). Recommendations to improve diets and nutrition require the rigorous and timely assessment of evidence (2). A systematic review of published research articles is an essential process to summarize the evidence, but involves time-consuming processes such as literature retrieval, review, and data extraction (3). Machine learning and ontologies are increasingly used to classify, store, and retrieve research output (4,5). Various scholars are piloting (semi-) automated methods of evidence synthesis (6). Natural Language Processing (NLP) enables computers to process human language and could help to extract relevant content from research articles (6)(7)(8). NLP typically operates in a supervised manner, using machine learning models that are trained using manually labeled data. The required labeled data for NLP could be provided by authors that annotate articles according to reporting guidelines.
To include essential information in research output, authoritative reporting guidelines (9) are widely used in biomedical sciences. An extension of the STROBE-Nutritional Epidemiology (STROBE-nut) statement (10) was developed to enhance the reporting completeness of nutrition research (11). The use of reporting guidelines is recommended by many journals, including those in nutrition (12). Authors are requested to provide information on the presence of essential content in the text during manuscript submission. This information, typically submitted as supplementary material, becomes redundant after peer reviewing.
There is considerable interest in developing a virtual research infrastructure to advance food and nutrition research (5), and such an infrastructure could enable largescale data analysis. So far, most efforts have been directed at the annotation, storage, and reuse of individual-level numeric data (e.g., anthropometric and food intake data of individuals). However, research article text represents a wealth of accumulated knowledge and evidence. Additional efforts are needed to annotate and store the text of research articles in a virtual research infrastructure. To enhance (re)use of research output, the FAIR principles are proposed to ensure data are Findable, Accessible, Interoperable, and Reusable. The FAIR principles are also appropriate to manage the text of research articles (13).
Here, we developed and tested a semiautomated approach to retrieve, harmonize, and analyze the reporting completeness of the articles reported according to STROBE-nut. By annotating the articles according to the STROBE-nut reporting guidelines, the essential characteristics of articles are converted into identifiers that can be processed by computers. When applied at scale, the approach could facilitate retrieval, archiving, accession, and use of nutrition research knowledge in a global research environment. Supplemental Table 1 clarifies the terms used from computer science.

Methods
Five of the included articles were randomly selected by DH. CL and DH reviewed the reporting completeness of the 5 articles independently according to the STROBEnut reporting guidelines, and highlighted the placement of the article content described according to the STROBEnut items. Disagreements were resolved through discussion until a consensus was reached. The remaining articles were assessed by DH and uncertainties were resolved based on the consensus among DH and CL.

Identification of components to enable automated tracking
We reviewed components required to enable automated tracking of nutritional research in a graph database. The components include existing concepts for information management, research infrastructures, programming languages, software, publisher interfaces, visualization tools, etc.

Automated processing of annotations
A Python (32) module was developed to process the structure and STROBE-nut annotations of the articles in XML format (https://github.com/cyang0128/Nutritional-epidemiologicontologies/tree/master/strobenut). Three types of functions are included in the module: 1) "Input.py": provides 2 functions to extract and store article metadata through an Application Programming Interface (API; i.e., a virtual interface used to retrieve the requested data from web servers) or in local XML files respectively; 2) "Annotate.py": annotates the reporting completeness of articles stored in a Neo4j graph database and reported according to STROBEnut reporting guidelines; and 3) " Figure.py": shows the statistics of reporting completeness of article(s) and the reporting frequency of different STROBE-nut items. All code was tested in Python (3.7.4) through Jupyter Notebook (6.0.1) (33).

Development of a graph database for semantic information
We conducted 3 subsequent actions to illustrate the development and application of a graph database for the management of reporting completeness of articles according to the STROBE-nut reporting guidelines.

Step 1: automated information extraction and storage in the graph database.
The metadata of articles such as "title," "abstract," "Digital Object Identifier" (DOI), "keywords," and content structure (e.g., Background, Methods, Results, Discussion, etc.) were retrieved and extracted by using Python. For the articles (n = 10) published by "Springer Nature," the metadata was obtained from the "Springer Nature API Portal" (34) (collection: "openaccess," result format: "jats"). For the remaining articles (n = 5), the corresponding XML files were downloaded and accessed through a local path. The following functions defined in 3 existing Python modules were used to extract the metadata: 1) "lxml" (35) and "re" (36) to locate article metadata in XML and 2) "py2neo" (37) to store the extracted data in a Neo4j knowledge graph database. Several classes and relations were used to explain the extracted metadata as well as their relations: Classes (i.e., categories of terms): 1) "Article" describes the type of scientific article. The 15 selected articles were arranged under this class, and their titles and DOIs were used as "property" to identify the articles uniquely. 2) "Abstract" describes the summary of a scientific article.
3) "Keyword/Keywords set" describes a word/a set of words about the scientific area of a scientific article. 4) "Section/Subsection" describes different parts of a scientific article such as "Methods," "Results," etc. as well as their subsections (e.g., statistical method).

Relations:
1) "section/subsection" describes the relation between a text and the resource of the text physically. Therefore, "section" was used to describe the relation between an article and the article's sections (e.g., Methods), and "subsection" was used to describe the relation between sections of the article (e.g., Methods) and its subsections (e.g., participants recruitment). 2) "hasAbstract" describes the relation between an article and its abstract. 3) "hasKeyword" describes the relation between an article and its set of keywords.
Step 2: annotation of reporting guidelines in the graph database.
To archive information regarding the reporting completeness of the selected articles, Python functions defined in "Py2neo" (37) and "Cypher" (38) were used to construct the new function (i.e., "Annotate.py") in Python. The function converts the annotation on the article to the virtual annotation in the graph database. Several classes and relations were selected from existing ontologies. An ontology represents a set of categories and terms that are interlinked by relations and annotated by properties. In the present study, the ontology is used to describe the annotation done according to STROBEnut reporting guidelines: Classes (categories): 1) "STROBE-nut article" indicates that an article is reported according to the STROBE-nut reporting guidelines. Therefore, the 15 selected articles were arranged under this class as well, and their DOIs were used as a "property" to identify the articles uniquely. 2) "STROBE-nut section/subsection" describes different parts of a scientific article such as "Methods," "Results," etc. as well as the subsections proposed in the STROBEnut reporting guidelines. 3) "STROBE-nut item" describes the 24 items proposed in the STROBE-nut reporting guidelines to guide transparent reporting (39).

Relations:
1) "STROBE-nut" is chosen to link the class "Article" and "STROBE-nut article" defined in the graph database. Semantically, it clarifies that the article is reported according to the set of the STROBE-nut items in the STROBE-nut reporting guidelines. 2) "section/subsection" is used to describe the relations between "STROBE-nut article," "STROBE-nut section," and "STROBE-nut subsection." Step 3: automated calculation of reporting characteristics.
Statistics of reporting completeness for articles were generated from the graph database. The following functions were included: 1) visualization of the reporting completeness of a single article; 2) representation of the overall reporting completeness for the meta-analysis of the articles, and 3) representation of the reporting statistics of the STROBE-nut items. Using the STROBE-nut reporting guidelines, the graph database of the 15 articles was developed using the same classes (categories) and relations, which enables automatic calculation of these statistics. Python functions (i.e., functions defined in the "Figure.py") were executed for the 15 articles to generate the figures. We used an example for each function to illustrate the approach. First, 4 bar charts are used to show the articles with the highest/lowest reporting completeness, and the frequently/rarely reported STROBE-nut items, respectively. Second, a radar chart shows the reporting completeness of a single article. The reporting completeness of information in 5 sections proposed in the STROBE-nut reporting guidelines (i.e., Title/Abstract, Methods, Results, Discussion, Other information) is presented in percentages, and the overall reporting completeness among the 5 sections is also shown on the top of the radar chart. Third, a pie chart is used to show an example of reporting completeness classification.

A graph database for nutrition research knowledge
A graph database is a database that uses relations to link terms to archive and represent semantic information (e.g., phrases or sentences) (40,41). The Semantic Web, as an extension of the World Wide Web, is being developed as the most comprehensive graph database to handle global data. Unique identifiers are given to identify terms as well as their relations. By tracking the identifiers, semantic information in a graph database can be accessed, harmonized, integrated, and visualized (42). For instance, the DOI was introduced in 2000 as a persistent and unique identifier of a scientific publication (43). The DOI of a research article is an indirect link to the article and also provides information regarding authors, journal, etc. through an API (40).
Neo4j has been the most commonly used graph database for several years (44). Moreover, Neo4j is compatible with the Python programming environment through the module "neo2py," which can be integrated in other Python modules (e.g., plotting library, web framework, rule engine, etc.) for the visualization, statistics calculation, etc. of a graph database. Figure 1 shows the components that can be used to develop a graph database for nutrition articles, i.e., an underpinning theory, software, research infrastructures, and publisher interfaces. A more detailed description of the technology used is included as Supplemental Table 4.
To convert the semantic information of nutrition articles into code, the semantic information needs to be reorganized in a machine-readable format. The DIKW (Data, Information, Knowledge, Wisdom) pyramid is a classical model for information classification in computer science (45). For nutritional epidemiology, the DIKW pyramid is a suitable approach to manage information (41), and provides the theoretical basis to classify the semantic information processed and stored in a graph database. In this study, the semantic information of the articles is classified as "Information" in the DIKW pyramid in Figure 1.
To manage the semantic information of nutrition research in a graph database, reporting guidelines provide a template to define relations between the terms. EQUA-TOR (Enhancing the QUAlity and Transparency Of health Research) (46) and FAIRsharing (47) are recommended as repositories to retrieve relevant reporting guidelines. Being the most comprehensive ontology repository of medical science, BioPortal (48) is recommended for retrieving the corresponding ontologies. For this study, Ontology for Nutritional Epidemiology (ONE) (39) was used.
To retrieve the semantic information of scientific articles, various publishers [e.g., Multidisciplinary Digital Publishing Institute (MDPI) (49), Frontiers (50), and Public Library of Science (PLOS) (51)] provide user interfaces (UI). The UIs enable interactions between a human user and a computer system to retrieve scientific articles in machinereadable formats. In addition, Springer Nature provides an API to facilitate data extraction (34). Python (32) was selected to manage the semantic information in the graph database.

Graph database visualization
A set of linked data is visualized in the Neo4j browser (37). We describe the visualization for only 1 article (17) as the graphs are similar for all scientific articles in nutritional epidemiology. Figure 2 shows the content structure of the scientific article stored in the graph database. The metadata of this article is visualized using different colors and connected by human-and machine-readable relations. The article consists of 5 sections, 11 subsections (6 subsections under "Methods" and 5 subsections under "Discussion"), 1 abstract, and a set of 3 keywords. Different relations were used to connect all the components extracted from its XML file. Figure 3 shows the annotation of the article's reporting completeness according to the STROBEnut reporting guidelines. Out of a total of 24 STROBE-nut items, 18 items are reported in the article, as indicated with brown circles. The sections and subsections proposed in the STROBE-nut reporting guidelines are visualized using red and blue circles, respectively. A class labeled as "Harmonized article" was used to gather all the STROBE-nut annotation components. In addition, the relation "STROBE-nut" is used to connect the STROBE-nut annotations of the article text and the content structure of the article. Through the annotation, the proposed sections of the manuscript in which the STROBE-nut items were reported (i.e., title/abstract, methods, result, discussion in the present article) are defined.

Automated calculation of reporting completeness statistics
In Figure 4, 6 different python functions are presented to summarize the reporting completeness of the articles (n = 15) stored in the Neo4j database.   (24). The STROBE-nut items in the section "Title/Abstract" and "Other sections" are fully reported. About 45% STROBE-nut items are reported in "Results," and <40% STROBE-nut items are reported in "Methods." There are no STROBE-nut items reported in the section "Discussion." Overall, ∼40% STROBE-nut items are reported in the article.

Discussion
Here, we described an approach to apply the STROBEnut reporting guidelines as digital annotations for research articles in nutritional epidemiology. We demonstrated a feasible and semiautomated method to annotate, retrieve, archive, link, and process essential study characteristics and the reporting completeness of nutrition research. To the best of our knowledge, this is the first attempt to apply the FAIR principles to reporting guidelines in biomedical sciences.
The present work illustrates the potential to use information on article content as a metric to assess articles. Previous attention has been directed to article-level instead of journal-level indicators to assess the added value of research (52,53). These metrics, known as altmetrics, however, currently mainly deal with scholarly citations and references in social media (54,55). The present approach paves the way for an appreciation of article content regarding its reporting completeness. For instance, a radar chart (Figure 4.3) could visualize the metric for a single article, and indicate the reporting completeness of different sections (e.g., title, introduction, methods, results, etc.) of an article. However, for literature reviews, it might be more appropriate to use summary statistics (Figure 4) to describe the overall reporting completeness of selected articles.
The present study used annotations done by experts involved in the development of STROBE-nut (DH and CL). For application at scale however, authors of articles are best placed to provide the annotations. Increasingly, nutrition journals encourage authors to use article-based reporting guidelines such as STROBE-nut to provide information on the reporting completeness of their scientific articles (12). To support authors, publishers could provide a userfriendly UI (e.g., a website) to collect data on reporting completeness during the submission process. The utility of reporting guidelines to date is mainly geared at improving the completeness of the manuscript at the time of submission and facilitation of peer review. Converting information on reporting completeness to linked content of articles enables new applications that add value for authors, journals, and search engines to classify and retrieve information. The approach presented is the first use case in this regard and could unlock new applications to increase the use of and adherence to reporting guidelines.
For application at scale, machine learning approaches can automate the annotation approach and generate annotations directly from the submitted manuscript text. A fully automated approach however, may affect the accuracy of the annotations and metrics. In a first instance, machine learning approaches should be used to suggest relevant terms to facilitate the annotation process during submission. Feedback from authors on the suggested terms will be collected to verify the quality of the annotation process. Gradually, an iterative process between the machine learning suggestions and author feedback will improve the accuracy of the automatic annotation process.
The current work contributes to other initiatives. For example, the "Springer Nature SciGraph" (34) is a Linked Open Data (i.e., linked and freely available online data) platform that aggregates data from publications of Springer Nature. "Springer Nature SciGraph" links data from reliable sources and presents how information is interconnected in the articles. Moreover, Elsevier provides an API, "the FIGURE 2 Visualization of 1 article's metadata (17) in the graph database. STROBE-nut, STROBE extension for nutritional epidemiology.
Elsevier Developer," for developers to gain access to linked data of journals, books, data, abstracts, etc. published by Elsevier (56). Here, the use of linked data for study reporting completeness could be instrumental to manage the content of scientific publications thereby improving the findability and accessibility of research findings.
To promote the use of a graph database, the encoding of XML-based scientific articles according to reporting guidelines is an interesting prospect. XML is a markup language with setting rules and vocabulary identifiers. It produces documents that are both human-readable and machine-readable. Using the STROBE-nut ontology to annotate scientific articles in XML, the annotated nutritional study content enables knowledge queries, integration, and inferences. In our study, essential study characteristics of the articles were converted to the machine-readable STROBEnut identifiers. Users can use these identifiers as well as their relations to find, filter, and integrate the content of the articles in a virtual research infrastructure. Moreover, rules can be developed to reason on the annotated study characteristics. For instance, when using the reported STROBE-nut items as inclusion and exclusion criteria for a systematic review, customized reasoning rules can be set to automate the selection of articles.
According to the quality rating of linked data introduced by Berners-Lee (57), the graph database presented here obtains a "4 star" rating: the database consists of humanreadable information of the used vocabulary; is machinereadable and enables semantic reasoning; has been linked to existing linked data sources (e.g., DOIs, etc.); and finally, the metadata about the vocabulary (i.e., definitions, uses, authors, references of reporting guidelines, etc.) are available. To achieve the full "5 star" status for linked data, the developed graph database should be connected to linked data networks that are in use. This would require combined efforts from different stakeholders involved in the development of research guidelines, and the production and publication of research findings. Research groups such as the EQUATOR network (46) and FAIRsharing (58) should promote the submission of reporting guidelines as linked data, and make the linked data available online for people to see, understand, adopt, and use. DOI Registration Agencies could register the reporting completeness information as linked data of DOI and enable the API for information query and linked data reuse.
As an application for nutrition research, we used a sample of articles that reported using STROBE-nut. The approach presented can relatively easily be expanded to other reporting guidelines such as PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) (  Experiments) (62). This would, however, require the development of relevant ontologies and substantial involvement of other guideline developers, which was beyond the scope of the present study.
To apply the present approach at scale, a culture of linked data in nutrition research needs to be fostered. Eventually, a new workforce of researchers will be required to apply information technology for nutrition research. Basic knowledge regarding the use of ontologies, open science, and FAIR data needs to be integrated in the curriculum of students and researchers.