The application of artificial intelligence and data integration in COVID-19 studies: a scoping review

Abstract Objective To summarize how artificial intelligence (AI) is being applied in COVID-19 research and determine whether these AI applications integrated heterogenous data from different sources for modeling. Materials and Methods We searched 2 major COVID-19 literature databases, the National Institutes of Health’s LitCovid and the World Health Organization’s COVID-19 database on March 9, 2021. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline, 2 reviewers independently reviewed all the articles in 2 rounds of screening. Results In the 794 studies included in the final qualitative analysis, we identified 7 key COVID-19 research areas in which AI was applied, including disease forecasting, medical imaging-based diagnosis and prognosis, early detection and prognosis (non-imaging), drug repurposing and early drug discovery, social media data analysis, genomic, transcriptomic, and proteomic data analysis, and other COVID-19 research topics. We also found that there was a lack of heterogenous data integration in these AI applications. Discussion Risk factors relevant to COVID-19 outcomes exist in heterogeneous data sources, including electronic health records, surveillance systems, sociodemographic datasets, and many more. However, most AI applications in COVID-19 research adopted a single-sourced approach that could omit important risk factors and thus lead to biased algorithms. Integrating heterogeneous data for modeling will help realize the full potential of AI algorithms, improve precision, and reduce bias. Conclusion There is a lack of data integration in the AI applications in COVID-19 research and a need for a multilevel AI framework that supports the analysis of heterogeneous data from different sources.


INTRODUCTION
In just a few months, the 2019 novel coronavirus disease (COVID- 19), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has rapidly spread around the globe, and at the time of this writing, there are over 100 million confirmed COVID-19 cases and a few million confirmed deaths from COVID-19 worldwide. 1 The COVID-19 pandemic is now the second deadliest pandemic in over 100 years, behind only the 1918 influenza pandemic (ie, Spanish Flu). 2 While the COVID-19 pandemic is still raging, and the number of cases are growing exponentially, the scientific communities around the world have reacted promptly by directing effects and resources to research studies on the etiology, transmission, detection, treatment, and prevention and control of COVID-19. In about a year, an outstanding number of over 100 000 research articles on COVID-19-related topics have been published according to PubMed. 3 Recent advances in artificial intelligence (AI) have provided novel methods and tools for combating global pandemics, such as COVID- 19. In classic computer science textbooks, AI is broadly defined as the study of intelligent agents, machines or devices that can imitate human cognitive functions to learn the environment and take actions. 4 The learning process is often implemented through mathematical or statistical models in computer programs. Machine learning, of which deep learning is a subset, is a branch of AI that trains algorithms that allow computer programs to automatically (ie, without explicit programming) improve through data. 5 In the fields of public health and medicine, AI techniques-especially machine learning and, more recently, deep learning methods-have been widely used for disease surveillance, health risks and outcomes prediction, medical diagnostics and therapeutics, clinical decision-making, and many more. [6][7][8] With surveillance tools, patient reporting systems, and clinical studies emerging quickly, large amounts of novel data have been rapidly accumulated during the COVID-19 pandemic. There is growing interest in leveraging these data to develop AI solutions for COVID-19 challenges. However, developing AI models in the era of precision health is not a trivial task. Precision health adopts a unified approach to understanding the full range of determinants of health for health promotion, prevention, diagnosis, and treatment. 9, 10 The vision of precision health can only be realized through the integration and examination of a comprehensive list of determinants of health that include genetic, biological, environmental, as well as social and behavioral factors. On the other hand, these determinants of health exist in various data sources that are heterogeneous in syntax (eg, file formats), schema (eg, data models and structures), and semantics (eg, meanings or interpretations of the variables). One of the first and most important challenges in building precision health AI models is integrating relevant data that contain determinants of health from the heterogeneous sources.
In this study, we conducted a scoping review of AI applications in COVID-19 research with a focus on heterogeneous data integration. Our goal was to summarize the COVID-19 research areas in which AI is being applied, the AI models being used in these research applications, and the data sources being used to build the AI models. We were particularly interested in examining whether these AI applications integrated heterogenous data from different sources for building the models and treated missing data in the variables of interest. Although a few published reviews have summarized the applications of AI or machine learning methods in COVID-19 research, [11][12][13][14][15] none of them examined data integration, and many focused on a specific area of COVID-19 research (eg, medical imaging 15 ). Note that we focused on the use of AI methods for data analysis and excluded other AI fields, such as robotics.

Search strategy
We searched 2 major COVID-19 literature databases, the National Institutes of Health (NIH) LitCovid (part of PubMed) 3 and the World Health Organization (WHO) COVID-19 database 16 for articles published through March 9, 2021. LitCovid is an openresource literature hub developed by the NIH for tracking up-todate scientific information about COVID-19. It provides a central access to all COVID-19-related articles in PubMed. 3 The WHO COVID-19 database contains global literatures of scientific findings and knowledge on COVID-19 gathered by the WHO. 16 Both databases are updated daily with newly published articles. The following query and keywords were used to search the databases: "artificial Intelligence" or "machine learning" or "supervised learning" or "unsupervised learning" or "deep learning" or "neural networks" or "natural language processing."

Literature screening
Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline, 17 we screened the articles retrieved from the databases in 2 rounds. First, we screened the titles and abstracts of the identified articles and excluded those that: (1) did not use any AI methods for data analysis, (2) were unrelated to COVID-19, (3) were reviews, editorials, opinions, letters to editor, or case reports, or (4) were not written in English. Second, we screened the full texts of the remaining articles to further exclude articles that met our exclusion criteria. Two reviewers (YZ and TL) independently reviewed all the articles in the 2 rounds of screening. Any conflicts between the 2 reviewers were reviewed and solved by a third reviewer (YG). We extracted and summarized COVID-19and AI-related information from the retained articles.

Summary
We summarized our review procedure in a PRISMA flow diagram in Figure 1. We identified 1311 and 1218 studies in the LitCovid and WHO COVID-19 databases, respectively. After removing duplicated studies, we included 1338 studies in the first round of screening. In the first round of screening of titles and abstracts, 492 studies were excluded according to our exclusion criteria, while 846 studies were included in the full-text review. In the second round of screening, another 52 studies were excluded based on full-text review and eventually, 794 studies were included in the final qualitative analysis.
The AI applications covered in these 794 studies can be categorized into the following areas of COVID-19 research: Disease forecasting (n ¼ 161), Medical imaging-based diagnosis and prognosis (n ¼ 322); Early detection and prognosis (non-imaging) (n ¼ 152); Drug repurposing and early drug discovery (n ¼ 53); Social media data analysis (n ¼ 44); Genomic, transcriptomic, and proteomic data analysis (n ¼ 24); and Other COVID-19 research topics (survey studies, literature mining, surveillance, clinical trials, miscellaneous topics) (n ¼ 38). We listed the full citations of all 794 studies by research area in the Supplementary Table S1. In the following sections, we summarized what and how AI techniques were applied in these areas. In particular, we determined whether the studies integrated heterogeneous data to expand the list of inputs (or predictors) for building the AI models. In line with Lenzerini 2002, 18 we defined data integration as the action of combining data that are heterogeneous in syntax, schema, and semantics and extracting predictors from these data for modeling. The total number of studies and the number of studies with data integration in each research area were summarized in Figure 2.

Disease forecasting
A total of 161 studies described the use of AI for COVID-19 forecasting (Supplementary Table S1). In these studies, 106 predicted future COVID-19 incidence or mortality using historical data only, 43 predicted future or confirmed COVID-19 cases using potential risk factors as inputs, 8 characterized country-level differences in COVID-19 outcomes worldwide (clustering studies), and 4 predicted future demands for hospital resources or medical consumables.
The majority of the 106 studies on predicting future COVID-19 incidence or mortality used COVID-19 data from the Johns Hopkins University Center for Systems Science and Engineering, 19 or local health authorities. In these studies, the long short-term memory (LSTM), a class of recurrent neural networks (RNN), was the most commonly used deep learning model. Other popular models included other types of artificial neural networks (ANN); machine learning models, such as random forest, support vector machines (SVM), and gradient boosting machine (GBM); statistical time series models, such as the autoregressive integrated moving average (ARIMA) model; and epidemiological models, such as the Susceptible-Infectious-Recovered and Susceptible-Exposed-Infected-Removed models. None of the 106 studies integrated heterogeneous data for modeling since only historical COVID-19 data were used as inputs.
In the 43 studies on COVID-19 risk factors, 27 examined environmental exposures, while the remaining 16 examined a range of other risk factors, such as population characteristics, socioeconomic status, or other health-related factors. Most of these studies used machine learning models, among which random forest and GBM were the most popular algorithms. A small portion of these studies used ANN, among which the multilayer perceptron (MLP) was the most popular. Among these 43 studies, slightly over half (n ¼ 24, 55.8%) integrated heterogeneous data on predictors for modeling (     used simple mean or median imputation, while the third study used the k-nearest neighbor (k-NN) method (Table 1). All 8 clustering studies used unsupervised machine learning models, with the most popular model being the k-means. These studies aimed to group and compare countries or regions based on COVID-19 incidence, risks, and preparedness or performance. Half of the studies (n ¼ 4, 50.0%) integrated heterogeneous data for modeling ( Table 1). One of the 4 studies imputed missing data with mean values ( Table 1).
The 4 studies on future demands predicted the need for intensive care unit (ICU) beds or medical consumables (eg, face masks) using data on COVID-19 cases or on consumable sales or production. All 4 studies used ANN (eg, MLP) or RNN (eg, LSTM), with some studies also building machine learning models. None of the studies integrated heterogeneous data for modeling.

Medical imaging-based diagnosis and prognosis
A total of 322 studies described the use of AI for analyzing medical imaging data for COVID-19 diagnosis and prognosis (Supplementary Table S1). All studies analyzed either computed tomography or chest X-ray data, except for 5 studies that analyzed images of lung ultrasound [48][49][50][51] or skin lesions. 52 The most common sources of medical images were local hospitals or healthcare systems and image datasets published on public domains, such as GitHub or Kaggle. In these imaging studies, roughly half used the convolutional neural network (CNN)-based models. More than 90% of these studies predicted COVID-19 outcomes using medical imaging data alone. Only 29 out of the 322 studies (9.0%) considered data from heterogenous sources for AI modeling (Table 2). In addition to imaging data, these studies considered influences from demographics (eg, age, sex, etc), clinical characteristics (eg, symptoms, lab results, disease history, etc), and other human factors (eg, exposure history) on COVID-19 outcomes. Five of these studies imputed missing data using simple mean or median imputation (Table 2).

Early detection and prognosis (nonimaging)
A total of 152 studies described the use of AI for COVID-19 early detection (n ¼ 52) and prognosis (n ¼ 100) (Supplementary Table  S1). The vast majority of the studies on COVID-19 early detection analyzed COVID-19 positivity (þ vs À, determined by the reverse transcription polymerase chain reaction test) as the study outcome using patient data from hospitals or healthcare systems. A wide range of AI models were used for prediction, although machine learning models (eg, random forest, GBM) were used more often than deep learning models. Furthermore, most studies used a single type of data for COVID-19 detection, such as lab test data (eg, blood cell counts or inflammatory biomarkers) or clinical symptoms. Only 8 out of the 47 studies (17.0%) integrated heterogenous data for modeling (Table 3). In addition to lab and symptom data, these studies considered data on comorbidity, medications, travel/ contact history, etc.
The vast majority of the studies on COVID-19 prognosis examined hospitalization, ICU admission, mechanical ventilation requirements, and/or death in COVID-19 patients using data from hospitals or healthcare systems. Traditional machine learning models were preferred over deep learning models, with the most popular model being random forest. Only 21 out of the 92 studies (22.8%) integrated heterogenous data for modeling (Table 3). These heterogenous data included demographics, clinical data (eg, lab, disease and medication history, and symptoms), genetic sequencing data, exposure history, etc.
In the early detection and prognosis studies that integrated heterogenous data (Table 3), 8 studies imputed missing data. Most studies performed simple imputation based on mean, mode, or median values, while 2 studies performed multivariate imputation by chained equations, 100,104 and 1 study imputed missing values using bagging trees. 96 Drug repurposing and early drug discovery A total of 53 studies described the use of AI for drug repurposing (36 studies) or early COVID-19 drug discovery (18 studies) (Supplementary Table S1). The majority of the studies focused on screening for candidate drugs in biomolecule or drug databases. Popular data sources included DrugBank (Food and Drug Administration [FDA]approved and experimental drugs), 110 ChEMBL (bioactivity database for drug discovery), 111 PubChem (substance and compound databases), 112 ZINC (commercially available compounds for virtual screening), 113 BindingDB (experimentally determined protein-ligand binding affinities). 114 Deep learning models (eg, CNN, RNN) were used more often than the machine learning models. Furthermore, 5 out of the 36 drug repurposing studies mined the literature for repurposable drugs. [115][116][117][118][119] All 5 studies used NLP-based methods to mine scientific literature or other relevant data. For example, 1 study examined the description of over 1.2 million bioassays in the ChEMBL database to identify COVID-19-related bioassays. 115 The 18 studies on early drug discovery mainly focused on screening for potential biomolecules (eg, virtual ligand screening) in ligand or compound databases (eg, ChEMBL, PubChem, ZINC, Bind-ingDB) that could target SARS-CoV-2 functional domains. Similarly, deep learning models were preferred over the machine learning models. None of drug repurposing or early drug discovery studies integrated heterogeneous data for modeling.

Social media data analysis
A total of 44 studies described the use of AI for analyzing social media data (Supplementary Table S1). In these studies, Twitter was the single most popular data source, with 32 studies analyzing tweets from all over the world. The other 12 studies used data from Facebook, Reddit, YouTube, Weibo, etc. Most social media studies adopted a similar analytic approach: NLP methods and tools for text extraction and processing, followed by topic modeling and/or a sentiment analysis. The most common method for topic modeling was the latent Dirichlet allocation, whereas a range of machine learning models were used for sentiment analysis including SVM, Naïve Bayes, k-NN, random forest, etc. None of the social media studies integrated heterogeneous data for modeling.
Genomic, transcriptomic, and proteomic data analysis A total of 24 studies described the use of AI for analyzing SARS-CoV-2 sequence data (eg, ribonucleic acid [RNA], small interfering RNA [siRNA ], or protein sequences) (Supplementary Table S1). One common analysis goal of many of these studies was to determine the unique SARS-CoV-2 RNA or protein features that could potentially be targeted for disease detection and drug or vaccine design. Over half of these studies analyzed the SARS-CoV-2 genome sequences in the National Center for Biotechnology Information GenBank. 120 Other data sources included the Protein Data Bank, 121 National Genomics Data Center of China, 122 or self-generated sequence data. A wide variety of AI models were used in these studies,

Other COVID-19 research studies
Survey studies A total of 14 survey studies used AI models for studying COVID-19related topics in various populations around world (Supplementary Table S1). The most common study outcomes were self-reported fear, stress, anxiety, and depression related to the pandemic. The majority of the studies used machine learning models, including random forest, XGBoost, SVM, and Naïve Bayes. Two of the studies, 123,124 which were based on the same online survey, collected text data using open-ended questions. These studies performed a sentiment analysis that involved sentiment scores calculation and clustering using the k-mean algorithm. None of the survey studies integrated heterogeneous data for modeling.

Literature mining
A total of 10 studies described the use of AI for mining COVID-19 literature (Supplementary Table S1). Literature mining studies on drug repurposing were summarized in a previous section. These 10 studies focused on summarizing topics and trends in COVID-19 research and identifying future research needs. All but 2 studies mined either PubMed or the COVID-19 Open Research Dataset. 125 Of the other 2 studies, 1 mined ClinicalTrials.gov to extract data on COVID-19-related trials, 126 while the other searched the Scopus database for a bibliometric analysis. 127 All of the studies involved NLP methods and tools (eg, word2vec, doc2vec). Some studies performed topic modeling and/or sentiment analysis. The only study that performed heterogeneous data integration was Reese et al (Table 4), 128 in which data from 13 heterogeneous knowledge sources (eg, scientific literature, COVID-19 cases, drug, genome sequences, chemicals, etc) were downloaded, transformed, and integrated to create the KG-COVID-19 knowledge graph.

Surveillance
A total of 6 studies described the use of AI for social distancing or syndromic surveillance (Supplementary Table S1). Three of these studies analyzed data from surveillance cameras for monitoring social distancing using well-known deep learning models for object detection, [131][132][133] including the single-shot detector, YOLO (you only look once), and/or the regional CNN detector. Two other studies focused on analyzing Bluetooth signal strength data with linear and logistic models for contact tracing 134 or developing NLP and deep learning-based pipeline for sentinel syndromic surveillance of COVID-19 using medical records. 135 The remaining study developed a Telegram Bot that could model individualized COVID-19 risk by integrating heterogenous data, including user responses and health/social data in medical records (Table 4). 129 This lone study involving heterogenous data used machine learning models random forest, SVM, and GBM.

Clinical trials
Two studies described the use of AI models in noninterventional clinical trials on COVID-19 patients (Supplementary Table S1). The 2 trials, namely the READY (NCT04390516) and IDENTIFY (NCT04423991), 136,137 were conducted by the same group of investigators based on the same machine learning algorithm (an XGBoost classifier) designed to predict mechanical ventilation and mortality within 24 hours upon hospital admission using inputs from clinical data. The READY trial evaluated the performance of the algorithm, 136 while the IDENTIFY trial identified a subpopulation of COVID-19 patients who had improved survival from taking hydroxychloroquine. 137 Neither study integrated heterogenous data for modeling.

Miscellaneous topics
A total of 6 studies did not fall under any of the previous research topics (Supplementary Table S1). In the lone study that integrated heterogeneous data for modeling, Abdalla et al integrated 43 sociodemographic variables from multiple sources (eg, Census Bureau, US Department of Agriculture, Centers for Disease Control and Prevention) and built elastic net models to examine how sociodemo- graphics impacted county-level social distancing (Table 4). 130 Of the remaining studies, 1 used ANN to perform a drive-through mass vaccination simulation, 138 while the other 4 used NLP methods and tools on various research topics, including cross-lingual clinical deidentification in electronic health records (EHRs), 139 dream reports analysis, 140 drug safety analysis by mining the FDA adverse event system, 141 COVID-19 clinical concept (signs and symptoms) identification, and normalization in EHRs. 142

DISCUSSION
As governments, research communities, and healthcare industries are actively attempting to address the COVID-19 pandemic, we are tasked to identify quick yet reliable solutions for screening, diagnosis, forecasting, surveillance, the development of vaccine or drugs, and so on. On the other hand, with large amounts of COVID-19related data being collected in novel surveillance systems, AI methods have been widely employed in assisting medical experts and researchers in addressing COVID-19 challenges. In this article, we reviewed 1338 recent studies that applied AI methods or technologies in COVID-19 research. In the 794 studies included in our final qualitative analysis, we identified 7 key areas in which AI was applied. We also found that a wide range of machine learning and deep learning algorithms were used for modeling, although some were used more frequently than others depending on the area of research.
It is not at all surprising that AI methods have been used extensively in many areas of COVID-19 research. AI has been revolutionary for many analytics challenges in medicine and public health. For example, just shy of half of the studies we reviewed were studies of medical imaging analysis for assisting COVID-19 diagnosis. In fact, the use of AI in diagnostic medical imaging has been extensively explored for many diseases, such as cancer, 143 cardiovascular diseases, 144,145 lung diseases, 146 and brain diseases. 147 In these applications, AI has shown impressive sensitivity-similar to or better than expert interpretation-in identifying patterns and abnormalities in medical images that can aid diagnosis. Another major AI application in COVID-19 research is disease forecasting, with onefifth of the studies we reviewed being in this category. Compared to popular statistical time series models such as the ARIMA, AI models such as the LSTM have been proven to have superior precision and accuracy when predicting time series data, 148 without making explicit assumptions (eg, stationarity) about the data. In several other areas of COVID-19 research, AI methods are the preferred data analysis tools because of their ability to handle large amounts of heterogenous data, including text data such as those in clinical narratives or on social media. For example, in drug discovery and genomic research, AI is ideal for analyzing massive amounts of sequence data (eg, proteomic or genomic data). 149,150 One limitation of the AI applications included in our scoping review is the lack of integration of data from heterogenous sources for modeling. In the era of precision health, it is critical to examine a comprehensive list of determinants of COVID-19 outcomes, including biological, clinical, social, behavioral, and environmental factors, that exist in various heterogeneous data sources. However, most studies we reviewed used data from a single source to perform the AI-driven tasks. For instance, over 90% of the imaging studies included in this review used data from radiological images only to build AI models for COVID-19 diagnosis. This single-sourced approach ignores other important risk factors such as clinical symptoms, exposure history, lab test results, and so on, leading to algorithms with bias (eg, confounding bias) 151 and suboptimal performance. In fact, many of the medical imaging studies that integrated heterogenous data have shown that data integration led to AI models with better performance compared to models built with imaging data alone. [53][54][55]62,65,69,[76][77][78] Furthermore, although some data are difficult to get due to privacy issues or simply being unavailable, there are still a range of public data on risk factors that could be easily obtained for modeling. Many studies we reviewed leveraged the "free" data sources, such as the huge amounts of environmental data from the National Oceanic and Atmospheric Administration or the socioeconomic data from the Census Bureau. Overall, integrating heterogenous but relevant data for modeling will help realize the full potential of AI algorithms, and thus improve precision and reduce bias. Our review highlights the need for a multilevel AI framework that supports the analysis of heterogenous data from difference sources.
Our scoping review has several limitations. First, our search strategy is not as comprehensive as that of a systematic review. For example, our keyword list did not include "AI." Articles that used the abbreviation "AI" without mentioning "artificial intelligence" were not included in this review. Although we do not expect a large amount of articles being omitted, we do acknowledge this limitation in keywords. Second, we searched 2 major COVID-19 literature databases rather than the traditional databases used in systematic literature reviews. Relevant articles were often indexed in these 2 COVID-19 databases with a delay of a few days up to months. Third, we did not perform a risk of bias assessment given this is a scoping review.

CONCLUSION
Huge amounts of novel data related to COVID-19 have emerged quickly during the pandemic. As a result, AI methods and technologies have been widely applied in efforts to overcome COVID-19 challenges. In this scoping review (date of literature search: March 9, 2021), we show that a broad range of AI algorithms are used for COVID-19 research, and these algorithms are primarily used in 7 major research areas. We also show that there is a lack of data integration in these AI applications and a need for a multilevel AI framework that supports the analysis of heterogenous data from difference sources.

AUTHOR CONTRIBUTIONS
JB and YG conceived the project. YZ and TL performed the literature search and article screening, with YG being the third reviewer. YZ and TL performed the information extraction and created the initial tables. YG drafted the manuscript. MP, FW, HX, and JB assisted in writing. All authors read and approved the manuscript.

SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.