The benefit of augmenting open data with clinical data-warehouse EHR for forecasting SARS-CoV-2 hospitalizations in Bordeaux area, France

Abstract Objective The aim of this study was to develop an accurate regional forecast algorithm to predict the number of hospitalized patients and to assess the benefit of the Electronic Health Records (EHR) information to perform those predictions. Materials and Methods Aggregated data from SARS-CoV-2 and weather public database and data warehouse of the Bordeaux hospital were extracted from May 16, 2020 to January 17, 2022. The outcomes were the number of hospitalized patients in the Bordeaux Hospital at 7 and 14 days. We compared the performance of different data sources, feature engineering, and machine learning models. Results During the period of 88 weeks, 2561 hospitalizations due to COVID-19 were recorded at the Bordeaux Hospital. The model achieving the best performance was an elastic-net penalized linear regression using all available data with a median relative error at 7 and 14 days of 0.136 [0.063; 0.223] and 0.198 [0.105; 0.302] hospitalizations, respectively. Electronic health records (EHRs) from the hospital data warehouse improved median relative error at 7 and 14 days by 10.9% and 19.8%, respectively. Graphical evaluation showed remaining forecast error was mainly due to delay in slope shift detection. Discussion Forecast model showed overall good performance both at 7 and 14 days which were improved by the addition of the data from Bordeaux Hospital data warehouse. Conclusions The development of hospital data warehouse might help to get more specific and faster information than traditional surveillance system, which in turn will help to improve epidemic forecasting at a larger and finer scale.


LAY SUMMARY
The objective of this work was to develop a forecast algorithm to predict the number of hospitalized patients at Bordeaux Hospital. In addition, we assessed the benefit of the Electronic Health Records (EHRs) information to perform those predictions. To perform this task, we used data between May 16, 2020, and January 17, 2022, from national database on SARS-CoV-2 epidemics, public database on weather and the data warehouse of the Bordeaux hospital. The outcomes were the number of hospitalized patients in the Bordeaux Hospital at 7 and 14 days. During the period of 88 weeks, 2561 hospitalizations due to COVID-19 were recorded at the Bordeaux Hospital. The best model had an error of 13.6% at 7 days and 19.8% at 14 days. EHRs from the hospital data warehouse improved the performance by 10% at 7 days and 20% at 14 days. Graphical evaluation showed remaining forecast error was mainly due to delay in slope shift detection. Forecast model showed overall good performance which were improved by the addition of EHRs data. The development of hospital data warehouse might help to get more specific and faster information than traditional surveillance system, which in turn will help to improve epidemic forecasting at a larger and finer scale.

BACKGROUND AND SIGNIFICANCE
Since the end of 2020, millions of SARS-CoV-2 cases have been reported worldwide. [1][2][3] This pandemic has had a major impact on health care with an increase of the hospitalizations leading to modifications of the organization of care and to unprecedented population lockdowns to avoid health care system saturation. [4][5][6] The ability to anticipate the evolution of the epidemic at a local level is critical to manage the health care system.
To achieve this goal, several forecasting algorithms have been proposed. 7,8 None was fully satisfactory. Cramer et al 8 compared different approaches, including regression, compartmental, ensemble, deep-learning, to forecast the number of death related to COVID-19 in the United States. Best models included ensemble, deep learning, and several compartmental methods. They all used epidemiological data and, depending on the model, mobility, and demographics data. Although the performances varied from algorithm to the other, the data used differed between best models without clear trend indicating which data would provide the best performance. Because many factors change over time (eg, population behavior, government policies, vaccine coverage, virus strain), long-term forecast of COVID-19 is impossible. In France, several approaches to short-term epidemic forecast have been proposed relying on linear algebra, 9 ensemble methods 10 or neural networks. 11,12 All models aimed to forecast hospitalizations (among other things), except the one proposed by Carvalho et al, 11 which focused on cases, ICU, and deaths. The data used were hospitalizations, Reverse Transcriptase Polymerase Chain Reaction (RT-PCR), Intensive Care Units (ICU), weather, mobility, vaccination, variants of concern, and mask wearing policy data. The approach proposed by Mohimont et al, based on Convolutional Neural Networks seems to achieve the best results with a normalized root mean square deviation at 14 days of 3.2% compared to a mean absolute percentage error of 20% of the approach proposed by Paireau et al at the national level. However, the periods of evaluation are different (May 2021 for Mohimont et al, 12 March to July 2021 for Paireau et al 10 ) and direct comparison is difficult. In addition, graphical evaluation showed struggle to anticipate slope shift, that is a change in the dynamic of the epidemic toward an increase or a decrease of the number of cases. In addition, the performance improvement added by the different data sources was not formally evaluated. Finally, no model was satisfactory and valid enough to implement immediately.
Previous work focused mainly on national 9-12 or regional 10,12 forecast. Yet, finer granularities are needed to inform on local epidemic evolutions. Because hospitals are key actors during this pan-demic and their saturation is a critical factor of sanitary policies, they are a relevant scale for local forecasting. Since November 2017, the Bordeaux University Hospital has developed a data warehouse based on i2b2 architecture. 13 It facilitates the use of electronic health records and allows extracting detailed information of the epidemic such as the emergency units and ambulance service notes. We hypothesized that those local data should improve the forecast of the SARS-CoV-2 epidemic.

OBJECTIVE
The objective of this work was to develop an accurate regional forecast algorithm to predict the number of hospitalized patients and to assess the benefit of the Electronic Health Records (EHR) information to perform those predictions.

MATERIALS AND METHODS
Aggregated data from May 16, 2020, to January 17, 2022, regarding French COVID-19 epidemic were included. In order to improve forecasting, several data sources were used.

Open data
Open data included both epidemiologic data from Sant e Publique France and weather data from National Oceanic and Atmospheric Administration (NOAA) Integrated Surface Database. 14,15 Both provide department aggregated data and are daily updated.
Sant e Publique France data included hospitalizations, number of RT-PCR, positive RT-PCR, proportion of positive RT-PCR, dominant variant, and number of first dose vaccinated. RT-PCR data were available by age and were grouped as 0-19, 20-59, and 60 and more years old categories. Variant identification data before February 18, 2021 were not available, and majority variant before that date was assigned to wild type.
NOAA data, including temperature, wind speed, humidity, and dew point, were extracted and the Predict Index for COVID-19 Climate Transmissibility-Index PREDICT de Transmissivit e Climatique de la COVID-19-(IPTCC) was computed. 15 Missing weather data were imputed using a 2-step procedure: (1) the mean value of the adjacent department was imputed; (2) remaining missing values were imputed using last observation carried forward.

EHR data
The Bordeaux Hospital is a large structure comprising 3 hospital structures taking care of nearly 250 000 hospitalized patients and 100 000 emergency consultations during 2020. 16 A data warehouse based on i2b2 structure was built in 2017. 13 This star architecture is based on a central fact table where each row represents a diagnosis, a laboratory result, a procedure, a medical observation, etc. Each fact is related to other tables with information about the patient, the visit, the provider, or type of fact. 17 This structure allows for quick data queries compared to the usual siloed organizations.
To perform those queries, ontology alignment have been performed on laboratory results, and ad hoc natural language processing tools have been developed, including ROMEDI (ie, a French drug terminology to extract drug information from text), IAMsystem (ie, a dictionary-based approach for name entity recognition), and SmartCRF (ie, a software to visualize and annotate EHR). [18][19][20] Several applied projects were performed using those tools including automatic detection of surgical site infection and transfusion associated circulatory overload. 21,22 Thanks to those previous experiences, the Bordeaux hospital data warehouse was used, during the pandemic, to describe the current state of the epidemic at the hospital level on a daily basis. Those data were then used in the forecast model including: hospitalizations, hospital and ICU admission and discharge, ambulance service notes, and emergency unit notes. Concepts related to COVID-19 were extracted from notes by dictionary-based approaches (eg, cough, dyspnea, COVID-19). Dictionaries were manually created based on manual chart review to identify terms used by practitioners. Then, the number and proportion of ambulance service calls or hospitalization in emergency units mentioning concepts related to COVID-19 were extracted. Detail of features is available in Supplementary Table S1.
Due to different data acquisition mechanisms, there was a delay between the occurrence of events and the data acquisition. It was of 1 day for EHR data, 5 days for department hospitalizations and RT-PCR, 4 days for weather, 2 days for variants, and 4 days for vaccination. For the training and evaluation of the model, the chosen date was the date of data availability to mimic a real-time streaming forecast.

Statistical and machine learning models
The outcomes were the number of hospitalized patients with SARS-CoV-2 infection in the Bordeaux hospital at 7 and 14 days. Several statistical models and machine learning algorithms were compared: linear regression and Poisson regression with elastic-net penalization, random forest, and Fr echet random forest (ie, a random forest derived method able to learn directly from time series). 23 Negative predicted values were forced to 0.

Modeling strategy
To train the model, the primary analysis used a matrix where each row corresponds to a day and each column to a feature from Gironde (ie, the Bordeaux department) open source and Bordeaux Hospital EHR data. Other department data were not added to limit the number of features. Prediction performance were evaluated depending on the data used for the forecast, initially using only hospitalization and RT-PCR and progressively adding SARS-CoV-2 incidence in Gironde, weather, EHRs, Vaccine, and Variant data. Because of the elastic-net penalization, 24 each day the model might select different features by shrinking beta coefficient of unimportant features to zero.
An additional analysis considered the Bordeaux Hospital as an additional unit among the other French departments. The advantage was to leverage the information from all the departments; the drawback was the impossibility to include specific information only available for the Bordeaux Hospital. Of note, the incorporation as a third level inside the Gironde department would be possible in theory but it would lead to the same restrictions and it would not be feasible in absence of other EHR data from other departments. Results from this additional analysis are available in Supplementary  Table S2.
Several feature engineering transformations were performed. The mean, minimum, and maximum value over the last 7 days were computed for each feature, as well as the first derivatives over the last 3, 7, 10, and 14 days. Features were smoothed using a local polynomial regression with a span of either 0 (ie, no smoothing), 7, 14, or 21 days to take into account outliers and weekly variations.

Model evaluation
The models were evaluated every day on the data available from December 1, 2020 to January 17, 2022 (ie, data from May 16, 2020 to December 1, 2020 were used for training only). The models were trained using all prior data available at a given date d and the forecast of the number of hospitalizations at both 7 and 14 days after d was evaluated. Prediction performances were evaluated according to median absolute error (MAE) and median relative error (MRE). Median was chosen over mean because it is less sensitive to extreme values which are frequent when the observed outcome is low (ie, a small absolute error could result in a huge relative error). Bootstrapped 95% confidence interval was provided with 500 samples (more details in the Supplementary Material). Graphical evaluation was also performed. Prediction intervals were estimated by bootstrap and also with an operational rule of more or less 10% and 20% of the predicted value at 7 and 14 days. For the latter, given d hosp the predicted hospitalization, the prediction interval at 7

Description
Relationship between predictors and hospitalizations change over time. For instance, as described in Supplementary Figure S1, both RT-PCR in the Bordeaux Hospital and in Gironde well anticipated hospitalizations from June 2020 to December 2020. Unfortunately, the relationship become less consistent beyond December 2020. For instance, Gironde RT-PCR are synchronous to hospitalization peak in April 2021 but do not anticipate it. In addition, the large increase of positive RT-PCR during the end of the summer 2021 is not associated with a similar increase of hospitalizations, which is probably due to vaccination. Those findings explain why the forecast of SARS-CoV-2 hospitalizations is difficult and why it is interesting to leverage different data sources to improve prediction capacity.  Figure 2C supports those findings and shows that using EHR data anticipates the April peak, better forecasts the 2021-2022 winter increase and is more robust to the hospital cluster of December 2020. Last, we evaluated the information added by the number of vaccinated people and the variants distribution. Those additional data provided similar forecast performance and vaccine-RT-PCR interaction improved forecast at 7 days but decreased it at 14 days. Similar results were obtained with the additional analysis considering the Bordeaux Hospital as an additional department; results are available at Supplementary  Table S2. Overall, we considered the model including all the data but without the vaccine-RT-PCR interaction as the best model. Information retrieval from data warehouse was done with a 1day delay whereas it was a 5-day delay for hospitalizations and RT-PCR at the department level. Therefore, forecasting using data from Bordeaux Hospital data warehouse gives a 4-day advantage over usual open data. To evaluate its consequences, we compared 7-day forecast with 11-day forecast and 14-day forecast with 18-day forecast. As shown in Table 1, this resulted in a MAE and MRE decrease of respectively 2.08% and 4.9% at 7 days and of respectively 2.6% and 3.7% at 14 days. Table 2 shows the performance according to feature engineering and statistical model. Model performance were improved by smoothing and by feature transformation (ie, mean, minimum, max-imum, and first derivative) of respectively 1.67 and 0.83 absolute error at 7 days and 2.47 and 3.24 absolute error at 14 days. Linear regression with elastic-net penalization outperformed random forest, Fr echet random forest, and Poisson regression with elastic-net penalization. Finally, the best model was the elastic-net model using hospitalizations, RT-PCR, weather, vaccine, variant, emergency units, ambulance service data, vaccine, and majority variant with a smoothing span of 21 days and the mean, min, max, and first derivative feature transformation. Figure 1 represents the feature importance of the model in term of percentage of selection over the days with asymptotic confidence intervals. At 7 days, the most important features were the ones related to hospitalization, as there is an intrinsic dynamic of hospitalizations, which is less susceptible to be influenced by external factors on such a short period. At 14 days, the most important features were related to RT-PCR in the 60þ years old group, weather data, emergency units, and hospitalizations. Figure 2A shows the best model predictions from 1 to 14 days. Predictions were mostly accurate except: (1) in December 2020 during the hospital nosocomial cluster at Bordeaux Hospital which had a specific dynamic, (2) in the end of March 2021 where the April decrease is anticipated 2 weeks earlier, and (3) during the summer 2021 and the winter 2021-2022 where the forecast is overestimating hospitalizations partly because the RT-PCR increased massively whereas the hospitalizations increased moderately as depicted in Supplementary Figure S1. The latter might be a consequence of the vaccination campaign and the omicron spread. Figure 2B shows the prediction intervals of the forecast using an ad hoc rule of respectively 20% and 40% prediction interval at 7 and 14 days, which has better coverage percentage than bootstrapped prediction intervals available in the Supplementary Material. This figure shows that prediction intervals are mostly correct except during summer 2021 and winter 2021-2022. The former is explained by the low number of infected-which generates narrow prediction intervals-and the introduction of vaccination which biases the prediction forecast. The latter is explained by the overestimation of hospitalization rise.

DISCUSSION
This work demonstrates good overall forecast ability, both in term of relative and absolute error of predictions which where both improved by the addition of weather and data-warehouse information. In addition, the intrinsic "rules" governing the epidemic evolve over time, and external interventions such as vaccination make the hospitalization time series nonstationary, which is an important challenge for a data driven approach. To mitigate this, we used an adaptive approach where the model is trained every day on histori-  cal data, which is realistic given the swiftness of penalized linear regression training. In addition, elastic-net penalization allows the model to select different features each day enhancing the adaptiveness of the model. In their review, Rahim et al 7 identify compartmental models and deep learning as the most common approaches to perform shortterm forecast, followed by machine learning and statistical learning. Formal comparisons of their performance is difficult because dates and locations vary across the reviewed articles. Table 3 summarize findings of selected related work. Cramer et al 8 describe the COVID-19 Forecast Hub, an open platform where several academic, industry, and independent groups proposed forecasting model for US SARS-CoV-2 cases, hospitalizations, and deaths, which permits head-to-head comparison. In their paper, they evaluated death forecast performance. They show that performances varied from one algorithm to another, showing no trend as to which data would consistently provide the best performance. Interestingly, the best algorithm used ensemble method, second and third best algorithms used compartmental method.
In France, forecast performance of previous studies are generally claimed to be good as depicted at Table 3. 9,10,12 However, benchmarking and comparisons are difficult due to different time period and geographical scales. Graphical evaluations help to qualitatively compare model behavior, especially considering slope shifts, but it is not always provided 9,12 and can show difficulties to anticipate those. 10 Off note, the work proposed by Mohimont et al 12 was able to anticipate November hospitalization slope shift at the national level.
The main strength of our study was to leverage information from different sources. The addition of weather did improve forecasting which is consistent with previous work. 25 The use of specific information from a hospital data warehouse improved the performance both by increasing the amount of information and by updating the information faster than public data sources.
In our work, the vaccine and variants data had little effect on the performance of the model. This was partly expected because vaccine is a monotonous increasing curve and majority variant is relatively stable over time as shown in Supplementary Figure S2. In addition, there might be mismeasurement bias of variant as not every Sars-CoV-2 positive swab is tested for variant identification. Furthermore, the information added by variant and vaccine might already be captured by both the RT-PCR and hospitalization features. We also observed that RT-PCR-variant interaction deteriorated the model performance at 14 days. This was explained by a spurious positive correlation between the interaction term and the hospitalizations. Nevertheless, both vaccine and variant are linked to the SARS-CoV-2 epidemic and the inability to leverage that information in machine learning approach might call for a more mechanistic one. [25][26][27] The moderate impact of the addition of Gironde data to the model might be explained by: (1) the information already being captured by the local data from the Bordeaux Hospital and (2) the data collection being slower at the department level than in the hospital. Data consolidation (ie, the update of prior data thanks to new information) was not taken into account because of lack of availability of data versioning. For instance, hospitalization related to Sars-Cov-2 is sometimes identified in the EHR through billing codes in International Classification of Disease, 10th revision, which are only available at the end of the sojourn. This also might have biased the performance evaluation.
Linear regression with elastic-net penalization outperformed other more complex models such as Fr echet or classical random for-est. This might be explained by: (1) a linear relationship between the features and (2) the difficulty for random forest to extrapolate features relationship outside of the training set space that a linear regression can handle more naturally.
This work outlined the advantage of adding information from the EHRs to improve the forecast, especially at 14 days and during November and December 2021 as depicted at Figure 2C. This was expected because the EHRs provide additional information about the local evolution of the pandemic through the emergency units and the ambulance service data. Those data inform on the more specific population of COVID-19 symptomatic patients who are different from the overall COVID-19 patients detected by RT-PCR and the severe COVID-19 patients who are hospitalized. Figure 3 outlined that model performance is not homogeneous. Relative error was generally higher when the number of hospitalizations was low (summer 2021, November 2021) which is expected because the denominator is lower during those periods. Absolute error was higher during December 2020, which is expected because the model did not learn on many observations yet, and there was a hospital cluster at this period. It was also higher during the summer 2021 that might be explained by the beginning of vaccine campaign effect.
Although the model performance can be considered as good, there are some limitations. First, performance tends to deteriorate when there is a sudden change of the hospitalizations dynamic. For instance, in February and March 2021, the decrease and the increase of the hospitalizations were not well anticipated by the model. Second, change of the infectiousness of the virus either due to a mutation or vaccine were not immediately learned by the model. This might explain the overestimate during summer 2021 and it might occur again in the future. Third, the model used for this task is a linear regression and it might not capture complex relations. As discussed before, random forest did not improve forecast but other machine learning methods such as reservoir computing may. 28 The model is currently used in the Bordeaux University Hospital on a daily basis to anticipate the evolution of the number of COVID-19 hospitalizations. The forecast is used in conjunction with other indicators (number of hospitalizations, RT-PCR, emergency unit's workload, etc.) and is discussed with both clinicians and public health experts. In our experience, the forecast is particularly informative to anticipate when a local peak is reached and the hospitalizations will decrease or the arrival of a new wave.

CONCLUSION
This work highlights the advantage of leveraging several different data sources to improve forecast accuracy. The development of hospital data warehouse might help to get more specific and faster information than a traditional surveillance system, which in turn will help to improve epidemic forecasting at a larger and finer scale.

FUNDING
This work has been partly supported by Inria, Mission COVID19, GESTEPID project, and Nouvelle Aquitaine regional funding (Prediction territorial COVID N 1333140).

AUTHOR CONTRIBUTIONS
TF contributed to the data curation, formal analysis, software and writing-original draft. VJ contributed to conceptualization, data curation, supervision, funding acquisition, writing-original draft. RG contributed to data curation, software, writing-original draft. BH contributed to conceptualization, methodology, supervision, writing-original draft. RT contributed to conceptualization, methodology, supervision, funding acquisition, writing-original draft.

ETHICAL APPROVAL
No ethics committee approval was needed for this work as data used for modeling were aggregated.