Optimizing COVID-19 surveillance using historical electronic health records of influenza infections

Abstract Targeting surveillance resources toward individuals at high risk of early infection can accelerate the detection of emerging outbreaks. However, it is unclear which individuals are at high risk without detailed data on interpersonal and physical contacts. We propose a data-driven COVID-19 surveillance strategy using Electronic Health Record (EHR) data that identifies the most vulnerable individuals who acquired the earliest infections during historical influenza seasons. Our simulations for all three networks demonstrate that the EHR-based strategy performs as well as the most-connected strategy. Compared to the random acquaintance surveillance, our EHR-based strategy detects the early warning signal and peak timing much earlier. On average, the EHR-based strategy has 9.8 days of early warning and 13.5 days of peak timings, respectively, before the whole population. For the urban network, the expected values of our method are better than the random acquaintance strategy (24% for early warning and 14% in-advance for peak time). For a scale-free network, the average performance of the EHR-based method is 75% of the early warning and 109% in-advance when compared with the random acquaintance strategy. If the contact structure is persistent enough, it will be reflected by their history of infection. Our proposed approach suggests that seasonal influenza infection records could be used to monitor new outbreaks of emerging epidemics, including COVID-19. This is a method that exploits the effect of contact structure without considering it explicitly.


Introduction
A novel coronavirus (SARS-CoV-2) is thought to have emerged in the last quarter of 2019 in Wuhan, China (1), and was declared a pandemic by the World Health Organization (WHO) on 2020 March 11 (2). By 2021 September 29, 219 million cases of COVID-19 and 4.6 million deaths (3) were reported worldwide. Infectious disease surveillance systems provide critical information on the occurrence of infections and allow early detection of COVID-19 out-breaks before they spread out of control. Surveillance of COVID-19 has relied mainly on reported cases, contact tracing, and projections (4,5), coupled with syndromic surveillance systems to track anomalous increases in COVID-like-illness (CLI) symptoms (5,6).
In the past decade, public health agencies have benefited from an influx of medical, epidemiological, and computational scientists, who have developed model-based capabilities for empirical analysis and mathematical modeling that capture the unfolding of epidemic outbreaks, their mechanisms, and their response strategies. However, the increasing frequency of unexpected emerging and reemerging infectious diseases demonstrates the need for improved capacity to accelerate outbreak detection in designing disease surveillance systems (7).
With more effective surveillance on more vulnerable individuals as potential infection reservoirs, we can more effectively uncover early signals of an emerging epidemic outbreak, allowing expedite and optimal deployment of resources for its control. Decades of epidemiology research have demonstrated the influence of the contact network structure on epidemic outbreaks by determining whether and when susceptible individuals are infected (7)(8)(9)(10)(11)(12). The surveillance strategies that map out individual contact behaviors fall into those based on static contact networks (13). Retrospective research has designed pioneer strategies based on topological structures-for example, ref. (14) presented a simple social-network-based strategy (random acquaintance) in a college population by monitoring friends of randomly selected students as the random acquaintance surveillance group (SG). The random acquaintance SG is expected to exhibit a signal 2 weeks earlier than the random surveillance strategy (selection of the surveillance subset randomly from the population). Moreover, ref.7 further investigate different centrality-based surveillance strategies, showing how the complete knowledge of the network of social interactions can be used to propose strategies that outperform random and random-acquaintance strategies.
Medical, epidemiological, and computational scientists have recognized the promise of network-based outbreak detection to improve epidemic preparedness and response. However, few of these strategies are applied to practical public health systems due to the challenging implementation needed, additional huge workforce, and economic cost to explore the generally unknown contact network (15). The deluge of available digital data on Electronic Health Records (EHR) in public health systems offers unprecedented opportunities to explore novel sentinel surveillance strategies, which have been used for contact tracing in South Korea in the context of the COVID-19 epidemic (16,17).
The aim of outbreak detection using sentinel surveillance is to detect a signal for the emerging outbreak as early as possible. This is similar to the well-studied problem of optimal vaccination on networks (18). Ref. (15) propose an innovative vaccination method targeting previously infected individuals as reported by individual infection history in EHRs. Previously infected individuals have a disproportionate probability of being highly connected within networks and transmitting to others. This targeted strategy is validated in contact network epidemiology simulations and confirmed by empirical clinical data from Israel (15).
The current study introduces a practical data-driven surveillance strategy to accelerate outbreak detection using the simple logic of targeting the earliest infected individuals by retrospective analysis of historical outbreaks. We assume that the latest influenza-like outbreaks would share closely similar networks of contacts as COVID-19 spreads in its early stages throughout the same region. Assuming that the past predicts the future in contact networks and that the past was affected by network structures, our method exploits the network structure without explicitly mapping out the contacts. In that sense, the method is network-free (even though the underlying processes are not).
Informed by historical influenza-like observations of individuals, we use mathematical epidemic models to systematically compare our proposed method with two well-studied surveillance strategies (e.g. random acquaintance and most connected) in the context of sentinel placement in networks where a COVID-19-like disease is spreading. We quantify the timing and accuracy of the information gained by these strategically chosen sensors, as well as the robustness in the selection of nodes with respect to the number of previous information (seasons) used and epidemiological outbreaks over different effective reproduction numbers, R e .

Surveillance strategy using EHRs of historical influenza infections
We propose a new surveillance strategy that uses individuals estimated with high risks of having an early infection in a new outbreak. We assume that each individual who acquired influenza infection in a previous influenza season would have a digital record in the EHR system, providing key epidemiologic information, including the potential infection time. Considering the effect of short-term cross-strain immunity after an influenza infection (23), we assume that each individual can be infected at most once in a single influenza season. We assume that the EHR data are available for multiple seasons.
Let η be the number of influenza seasons with EHR data, and R e (i) the effective reproduction number of influenza infections in each season i = 1, 2, . . ., η. Let η j be the number of influenza seasons in which individual j has EHR records of influenza infections. Let τ i j be the time at which individual j acquires infection in influenza season i, according to the EHR records. With these definitions, we assess the expected risk of having an EHR record of influenza infection in any influenza season for individual j as which essentially estimates individual j's expected infection time over all influenza seasons. The node with higher eigenvector centrality (a measure of the influence of a node operationalizing the recursive idea that central nodes are those who have many central neighbors) has a smaller area under the curve of τ j and R e (Fig. 1). As in ref. (7), we consider the sentinels of surveillance nodes as the top 1% of individuals with highest expected risks of influenza infections over all seasons. Let EHR-I η denote the EHR-based strategy using η influenza seasons of EHR records. We test the surveillance performance with η increasing from 1 to 10 seasons of EHR records. Our main analysis uses EHR-I 5 with five seasons of EHR records (Fig. 2), because further increasing the number of EHR seasons η will give similar results (Fig. 3).

Conventional network-based surveillance strategy
We compare our EHR-based strategy with two conventional network-based surveillance strategies, including (1) the mostconnected strategy, which uses the top 1% of hub individuals with the highest numbers of network connections, and (2) the random acquaintance strategy, which first randomly selects 1% of individuals in the network and then uses one random acquaintance of each randomly selected individual as the surveillance node. As in ref.7, we use 1% of individuals in the network as the surveillance nodes.  Performance of the most connected (red), random acquaintance (blue), and EHR-based (green) strategies. The EHR-based strategy here uses the EHR records obtained from five historical seasons as an example. In the upper panel, the horizontal and vertical axes present the early earning (days) and peak timing (days) measures for each strategy. In the bottom panel, the horizontal and vertical axes present the peak magnitude and situational awareness measures for each strategy. Panels from left to right correspond to the results using urban, scale-free, and student networks, respectively. In each panel, dots and error bars indicate the mean and standard deviation across 100 simulations of each strategy.

Simulation settings
We simulate the spread of SARS-CoV-2 and seasonal influenza in contact networks, in which nodes denote individuals and edges denote physical contacts. We use three different networks, including the urban network of public wireless network usage (19), scalefree network built with Barabási-Albert (BA) algorithm (20), and students network of class attendance (7). The degree distribution has a power-law pattern in the scale-free and urban networks and has a Poisson-like pattern in the students network. Details of these networks are specific in Methods. We use a susceptible-exposedinfectious-recovered (SEIR) epidemic model to describe the historical spread of seasonal influenza, and use an susceptibleexposed-asymptomatic-symptomatic-recovered (SEAYR) model to describe the contemporary spread of SARS-CoV-2 (Methods). . We compare the average eigenvector centrality of surveillance nodes identified from each strategy including random acquaintance strategy, most-connected strategy, and EHR-based strategies with increasing years of EHR influenza records. Panels from left to right correspond to the results using urban, scale-free, and student networks, respectively. The horizontal axis, the timeline, denotes % earliest infected nodes. Each strategy identifies a collection of surveillance nodes. For each EHR-I 5 strategy, we run 100 simulations. For example, a simulation of EHR-I 5 , uses the first five sequential influenza-like simulations to select the surveillance nodes. We evaluate the average eigenvector centrality of 1% nodes over the horizontal axis with each step as 0.01. We observe that EHR-based performs better with increasing historical outbreaks involved. The average eigenvector centrality of surveillance subset EHR-based with shorter history, lays between that of the acquaintance and most connected strategies.
We use the stochastic chain-binomial approach to simulate the spread of these epidemics in contact networks.

Four criteria to evaluate the performance of surveillance strategies
We consider the disease prevalence of SARS-CoV-2 as the proportion of infected, individuals including exposed, asymptomatic, and symptomatic individuals at a given time. Based on refs. (7,21,22), we use the following four criteria to evaluate the performance of each surveillance strategy in monitoring the SARS-CoV-2 epidemic: (1) Early warning. Let t EP μ be the time at which the disease prevalence in the entire population (EP) reaches a predefined threshold μ, and t SG μ the time at which the disease prevalence in the SG reaches the same threshold μ. We consider μ = 1%. The early warning criterion measures the time lag: t EP μ − t SG μ . (2) Peak timing. Let t EP peak be the time at which the disease prevalence in the EP reaches the peak, and t SG peak the time at which the disease prevalence in the SG reaches the peak. The peak timing criterion measures the time lag: t EP peak − t SG peak . (3) Peak magnitude. Let r EP peak be the peak value of the disease prevalence in the EP, and r SG peak the peak value of the disease prevalence in the SG. The peak magnitude criterion measures the ratio: r EP peak /r SG peak . (4) Situational awareness. The complement of the normalized mean absolute error (MAE) of the time series of distance prevalence between the SG and EP: which is minimized over possible time lags λ (7). Here, x t and y t denote the disease prevalence of the simulated SARS-CoV-2 epidemics in the EP and SG at time t, respectively.

Main findings
The EHR-based strategy outstrips the random acquaintance strategy in almost all four evaluation criteria. The performance of our EHR-based strategy is comparable to that of the most connected strategy (Fig. 2). In heterogeneous networks including the urban and scale-free networks, the EHR-based strategy and the mostconnected strategy both provide good performance in the surveillance tasks of early warning and peak timing. In all tested networks, the peak magnitude predicted by our EHR-based strategy is much closer to that predicted by the most-connected strategy as compared to the random acquaintance strategy.
Specifically, in the urban network ( Fig. 2A and D), on average, the EHR-based strategy can trigger an early warning 9.8 days before the whole population reaches a predefined threshold, 24% faster than the random acquaintance strategy (7.9 days on average). On average, it has a peak timing, peak magnitude, and situational awareness of 13.5 days, 22.38 days, and 0.15, respectively. When compared with the random acquaintance, the EHR-based strategy shows a peak timing 14% higher, an overestimation of the peak magnitude of 3.2 times and 44% decrease in situational awareness. In the scale-free network ( Fig. 2B and E), the EHR-based strategy has an early warning, peak timing, peak magnitude, and situational awareness of 6.3 days, 8.9 days, 1.82, and 0.54, respectively. On average, the performance of our strategy shows a 75% improvement in early warning, 109% overestimation in peak timing, 1.1 times in peak magnitude, and 75% decrease in situational awareness compared with random acquaintance. In the student network ( Fig. 2C and F), the EHR-based strategy has early warning, peak timing, peak magnitude, and situational awareness of 3.61 days, 2.68 days, 2.08, and 0.63, respectively. When compared with the random acquaintance, it has 7% less in early warning, 87% overestimation in peak timing, 1.15 times in peak magnitude, and 9% decrease in situational awareness.
To explain the performance of the surveillance strategies, we explore the subsets of individuals selected by these surveillance strategies in terms of their eigenvector centrality. Figure 3 suggests that the EHR-based strategy appears to select nodes with higher eigenvector centralities. Increasing the number of influenza seasons used in the EHR-based strategy facilitates the identification of central nodes in the network. Lastly, EHR-based selection of nodes, is more similar to the selection made by the most-connected strategy, as we increase the number of historical outbreaks ( Figure S4, Supplementary Material).

Discussion
From contact network epidemiology, we know that central nodes are at higher risk of being infected early in an epidemic and can, thus be identified as being disproportionately represented among the previously infected in seasonal diseases (for instance, influenza, Chlamydia, and Lyme disease) (15). Building on the availability of EHR systems, we propose a novel surveillance strategy; selection based on historical records of infection, which can be implemented in the context of sentinel placement for COVID-19 surveillance, denoted as the EHR-based strategy. The advantage of this approach is that if the contact structure (or risk behavior based on the connectivity of individuals) is persistent enough, then it will, on average, be reflected by the history of infection of each individual. Thus, this method exploits the effect of contact structure without the knowledge of the network itself-which is both difficult and relies on the assumption that the structures are persistent.
Through epidemic simulations in static contact networks, we found that this novel strategy can accelerate the epidemic outbreak detection process, competing with the other static-network strategies in terms of practicality and early warning. To assess the centrality dynamics of nodes selected by the EHR-based strategy, we calculate and compare (computationally and theoretically) the eigenvector centrality of the nodes selected by each strategy on both empirical and synthetic networks. We find that our proposed surveillance strategy is competitive when compared with other strategies and depends on the number of historical outbreaks and the public health objective.
We studied the relationship between the selection of nodes using the EHR-based strategy and the optimal theoretical surveillance subsets (see Method). Following percolation theory on networks (7,23), where an SEIR infectious disease is spreading, we calculate analytically the optimal surveillance subset. We show that the selection of nodes in the surveillance subset when applying the EHR-based strategy, and the optimal theoretical selection of nodes (those with the highest eigenvector centrality) tend to be similar, as the number of historic records increases.
In the context of an actual new emerging or reemerging infectious disease (e.g. , the EHR-based strategy can be applied using historical records of a different (related) disease. The ranking of individuals can be learned from the knowledge of other infectious diseases belonging to the same spatial scenario, concurrently or sequentially. For example, the transmission dynamics learned from the surveillance of seasonal influenza can be used to estimate the outbreak risk of varicella in Taiwan (24).
Although we believe our qualitative results are robust and implementable, we need to address a few simplifying assumptions. First, our model does not account for the reinfection of influenza within a single flu season. The temporal cross-strain immunity is estimated with a short duration according to the real-world data (e.g. 42 days in the US (25)). However, if the circulating strain remains the same during two consecutive influenza seasons, the prior immunity gained in the past season may protect the previously infected individuals from the reinfection of the same strain. To reduce this potential bias, we suggest excluding these years. Second, our proposed strategy identifies a small proportion of the population as surveillance nodes for early detection of new outbreaks. Our identified surveillance individuals may not be representative of the EP, and hence may not be suitable for other surveillance purposes such as estimating the final attack rate or population prevalence. Third, the accessibility of EHR data could be limited by privacy-related restrictions, which could narrow the applications of the method. Fourth, the basic reproduction number may not be estimated directly in an influenza season. However, it could be approximated using the effective reproduction number, vaccine coverage, and vaccine efficacy. Fifth, it is possible that not all infected individuals will have their influenza records registered. We perform a sensitivity analysis by reducing the probability of seeking treatment and having an influenza record in the EHR for each infected individual P (health-seeking) from 75% to 25% (Figures S1 to S3, Supplementary Material). We find that the EHR-based strategy does not work well when P (health-seeking) reduces to 25%.
We conclude that the proposed EHR-based strategy for sentinel surveillance selection is competitive with other existing surveillance strategies in networks. This strategy, in general, could prove useful to public health policy makers, by offering a practical and robust alternative without the knowledge of individual contact behaviors, especially when a long enough history of EHR in public health systems is available. In this study, we provide a new method for surveillance of populations, which can also be used synergistically with network-based strategies. Additionally, our EHR-based strategy could be extended to consider the case of targeted testing and targeted vaccination.

Modeling the historical spread of seasonal influenza in contact networks
We simulate epidemic outbreaks using a stochastic chainbinomial model in contact networks with nodes as individuals and edges as interpersonal physical contacts. The degree of a node is the number of other nodes connected to it via its edges.
For seasonal influenza, each individual has four states: susceptible (S), exposed (E), infectious (I), or recovered (R). The transmission rate of the disease is β. Node i will remain exposed for 1/σ days and infectious for 1/γ days, after which it will recover. The basic reproduction number of a disease, denoted R 0 , demonstrates the expected number of secondary infections caused by a single infection in an entirely susceptible population, commonly used to indicate the epidemic growth rate, which is approximately equal to the effective reproduction number (R e , the average number of secondary cases per infectious case in a population made up of both susceptible and nonsusceptible hosts) given most individuals are susceptible in our simulations. We fix σ and γ for every simulation to 4 and 7 days, respectively, within the range of estimates for common respiratory diseases, including influenza (26). The disease prevalence is counted as the number of people in E over time. Let R 0 follow the distribution of Triangular(1.12, 1.25, and 1.33) according to the seasonal influenza epidemics over countries from 2000 to 2011 (27).

Modeling the contemporary spread of SARS-CoV-2 in contact networks
To test the performance of the proposed strategy on COVID-19 scenarios, each individual has five states: susceptible (S), exposed (E), asymptomatic (A), symptomatic (Y), or recovered (R). Node i will remain exposed for 1/σ C days, after which it will become infectious for 1/γ C days as asymptomatic and symptomatic with probabilities of 1-p sym and p sym , respectively, after which it will recover. The infectiousness of asymptomatic individuals is likely to be different from those with symptoms, perhaps by shedding lower quantities of the infectious agent and having more potential contacts with others (28). And asymptomatic individuals have been considered with obvious differences of infectiousness, in contrast with others with symptoms (29). The relative infectiousness of an asymptomatic individual (A) is ω. The transmission rate of the disease is ωβ C and β C for asymptomatic and symptomatic states, respectively. We fix R 0 , σ C , γ C , ω, and p sym for every simulation to 2.5 (1, 30), 1/5 days (31), 1/2.5 days (32), 0.5 (32), and 0.75 (33), respectively. In contact networks (with arbitrary degree distributions but random in any other aspect), R 0 correlates with the transmission rate, as (34) where k and k 2 denote the mean and mean square of degree. Following the static-network strategies' evaluation (7), given a specific R 0 , we use Eq. 1 to solve the corresponding transmission rate β for influenza. For COVID-19 scenarios, we estimate β C by β psym+ (1−ωpsym ) . We start simulations in scale-free and student networks with one randomly sampled seed to be exposed, while urban networks have 100 seeds. We investigate various epidemic outbreaks in networks to reflect the transmission variation of infectious disease (7).

Contact network datasets
In this study, we consider the spread of epidemics in a networked population, in which individuals in the population are connected through contact networks (7,(35)(36)(37). Following ref.7, we use the following three networks in which the interpersonal contacts are described by unweighted connections. We consider these three networks to explore the influence of their distinct topological properties.
(1) Urban network. This colocation network consists of 103,425 users (i.e. nodes) of the Île Sans Fil free public wireless network in Montreal, Canada. In this network, the connections represent the concurrent hotspot usage (19). (2) Scale-free network. This topologically heterogeneous network is generated using the seminal BA algorithm (20). (3) Student network. This network consists of 4,634 students (i.e. nodes) of the Engineering Department from the Universidad de Los Andes in Mérida-Venezuela. In this network, the connections indicate that a group of students shared at least one class during the fall 2008 semester (7).
The degree distribution shows a power-law pattern in the scale-free and urban networks (20,19), and shows a Poisson-like pattern in the students network. Furthermore, the urban network has a strong community structure (19).

Centrality dynamics of our strategy
We use the analytical method developed in ref.7 to study the relationship between the surveillance nodes selected in our EHRbased strategy and the analytically derived optimal set of surveillance nodes. We use an SEIR-like epidemic model to describe the spread of seasonal influenza in a network of size N, in which β denotes the transmission rate per infectious contact and γ denotes the recovery rate. According to the percolation theory developed in refs. (7,23), during the initial outbreak, the probability that each node acquires infection at time t is approximated as where κ is the leading eigenvalue of the adjacency matrix of the network, and v the corresponding eigenvector. This formula suggests that nodes with larger eigenvectors are more likely to have an earlier infection. Therefore, ref. 7 suggests that the optimal set of surveillance nodes need to include those nodes with highest eigenvector centralities. Let M EHR-I be the set of surveillance nodes used in our EHRbased strategy, which are determined by historical EHR influenza infection records. During the initial outbreak, the eigenvector centralities for the surveillance nodes in our EHR-based strategy is given by where 1 EHR−I is an indicator vector with elements being 1 if the corresponding nodes are chosen as surveillance nodes in the EHRbased strategy and vice versa. Our EHR-based surveillance strategy can identify high-risk nodes with largest eigenvector centralities, as indicated by the large average eigenvector centralities c EHR−I for nodes acquiring earliest infections (Fig. 3).
Nodes are infected and selected as time advances. During the initial outbreak, c EHR−I is increasing with time via selecting nodes with high eigenvector centralities. After that, low eigenvector centrality nodes will be infected and selected (7), and thus, the EHRbased SG tends to be the optimal by selecting those nodes infected earlier than other nodes, which tends to have higher eigenvector centrality.
Following ref.7, let τ EHR−I and τ be the times at which the EHRbased SG and the other SG with size M reach the same prevalence threshold p. Let 1 be the indicator vector of dimension N, denoting the nodes selected by the EHR-based strategy. As thus, The timing of early warning achieved between the two SGs of the EHR-based and the other SGs, denoted t EHR−I = τ − τ EHR−I , implies where c EHR−I = v · 1 EHR−I /M EHR−I and c = v · 1 EHR−I /M are the average eigenvector centralities in the two surveillance subsets, respectively. The early warning timing between the other SG and the EHR-based surveillance subset is determined by the ratio of their average eigenvector centralities. Therefore, during the initial outbreak, high eigenvector centrality nodes become infected with higher probability. After this initial regime, where most nodes with the highest eigenvector centralities have been infected, the infection spreads to nodes in the periphery of the network, i.e. nodes with low rankings of eigenvector centrality. Therefore, the average eigenvector centrality of all infected nodes decreases smoothly as time increases.
Considering an individual j in season η j , the probability, x(t j , η j ), of being infected at time t j is proportional to The ratio of θ (t j , η j ) in two seasons (i and j) implies Hence, in our proposed strategy, the historical vulnerability of an individual is a combination of τ i j and R 0 (i) (or R e (i)) the time of individual j in season i.