Optimal non-pharmaceutical intervention policy for Covid-19 epidemic via neuroevolution algorithm

Abstract Background National responses to the Covid-19 pandemic varied markedly across countries, from business-as-usual to complete shutdowns. Policies aimed at disrupting the viral transmission cycle and preventing the overwhelming of healthcare systems inevitably exact an economic toll. Methodology We developed an intervention policy model that comprised the relative human, implementation and healthcare costs of non-pharmaceutical epidemic interventions and identified the optimal strategy using a neuroevolution algorithm. The proposed model finds the minimum required reduction in transmission rates to maintain the burden on the healthcare system below the maximum capacity. Results We find that such a policy renders a sharp increase in the control strength during the early stages of the epidemic, followed by a steady increase in the subsequent ten weeks as the epidemic approaches its peak, and finally the control strength is gradually decreased as the population moves towards herd immunity. We have also shown how such a model can provide an efficient adaptive intervention policy at different stages of the epidemic without having access to the entire history of its progression in the population. Conclusions and implications This work emphasizes the importance of imposing intervention measures early and provides insights into adaptive intervention policies to minimize the economic impacts of the epidemic without putting an extra burden on the healthcare system. Lay Summary We developed an intervention policy model that comprised the relative human, implementation and healthcare costs of non-pharmaceutical epidemic interventions and identified the optimal strategy using a neuroevolution algorithm. Our work emphasizes the importance of imposing intervention measures early and provides insights into adaptive intervention policies to minimize the economic impacts of the epidemic without putting an extra burden on the healthcare system.


INTRODUCTION
March 2020, the World Health Organization (WHO) announced that Covid-19, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1], 'can be characterized as a pandemic' [2]. Within a month, most countries around the world had taken public health measures to contain the spread of the novel virus [3]. However, the type and severity of implemented measures and their subsequent success in minimizing the public health impacts of the outbreak varied greatly by country [4]. This variation in policies and their effectiveness reflects the complexity of finding the balance between two often competing policy objectives: protecting the public's health versus minimizing the economic impact of intervention measures [5].
Initially, without access to pharmaceuticals, studies focused on two distinct control approaches: mitigation and suppression [6][7][8]. The mitigation strategy aims to reduce transmission such that healthcare systems are not overwhelmed, while aiming to maintain the chain of transmission in order to achieve herd immunity. In contrast, the suppression strategy is aimed at virus elimination. In hindsight, countries that acted early to suppress the disease have excelled at minimizing both the public health and economic impact of the epidemic [9][10][11]. While early suppression measures appear to outperform the mitigation strategy both in terms of public health goals and economic costs, such policies would not necessarily be successful in countries where citizens are more averse to government-enforced control and surveillance measures [12]. Moreover, suppression measures would only be successful if implemented in the early stages of the epidemic and sufficiently strictly as to curtail transmission effectively. In a number of settings, however, suppression has been implemented in a piece-meal manner, leading to periods of drastic interventions including lockdowns punctuated by relaxation of social distancing measures and subsequent uptick in transmission [13,14]. This prompted us to examine the optimal mitigation strategy, which aims to manage or mitigate the healthcare impacts of the epidemic while population approaches herd immunity.
Characterizing immediate and long-term economic, social and human burden of Covid-19 epidemic is challenging and has led to several research efforts to examine the optimal intervention policy from various perspectives. It is unfeasible to review comprehensively this body of work, so we confine ourselves to a number of the key studies. Rowthorn and Maciejowski [15] investigated the optimal uniform lockdown in a Susceptible-Infectious-Recovered (SIR) model assuming a variety of parameterizations [15]. Their objective function assigned monetary values to costs arising from infection, lockdown, and value of life. Their main finding was that in the medium term, a policy that maintains effective reproduction number value close to 1 provides the best path. Bethune and Korinek [16] contrasted the decisions made by rational, individual agents with the choices made by a social planner who is able to coordinate the choices of individuals [16]. They found that rational agents generate large externalities because they fail to internalize the effects of their economic and social activities on others' risk of infection. Alvarez et al. formalized the social planner's dynamic control using an SIR epidemiological model and a linear economy [17]. The best strategy starts with a severe lockdown two weeks after the epidemic, covers 60% of the population after a month, and progressively decreases to 20% of the population after three months. More recently, a number of studies have broadened this exploration to identify age-specific optimal control strategies [18,19].
Inspired by Salimans et al., Such et al. and Riolo and Rohani [20][21][22], we sought to use a neuroevolution strategy to finding the optimal policy function which would dynamically determine the minimal required reduction in transmission rates at each time instant, deemed as 'control strength' hereafter. Reductions in transmission may result from lower contacts (due to isolationin-place ordinances, movement restrictions or lockdown policies), or the adoption of personal protective measures that serve to curtail transmission upon contact (such as the use of face masks and personal protective equipment, PPE), with varying societal impact. The fitness function is expressed such that a strategy is rewarded for allowing the epidemic to remove individuals from the susceptible pool without overwhelming the healthcare capacity. The proposed neuroevolution strategy begins by initializing a population of random policy functions. The generated policy functions are then used to simulate the trajectory of the epidemic. The fitness of each function is then evaluated based on the specified reward function. The most elite policy functions are then perturbed (mutated) to generate the next generation (offspring). The new population is then evaluated and this process is repeated for a pre-defined number of iterations. We also derived the optimal control solution via Pontryagin's maximum principle (PMP) [23] and compared the results with the optimal neuroevolution policy.
We have chosen the United Kingdom as our target population to implement the proposed approach. The choice of the UK as our target population was largely motivated by the frequent changes in the government's strategy to contain the epidemic [24], as summarized in Fig. 1. The UK's initial response was a mitigation policy, majorly inspired by the response to the flu pandemic, with an emphasis on protecting the most vulnerable to avoid overburdening the healthcare system in an effort to achieve herd immunity [9]. This initial policy later changed to a suppression policy by implementing lockdowns and imposing face mask-wearing requirements. Looking back at the early days of the epidemic, this study aims to understand how an effective mitigation policy could have been implemented (see Ref. [9] for a comparison of initial responses to Covid-19 by different countries including United Kingdom).
Our study explores mechanisms for 'flattening the curve'-it is motivated by the Covid-19 pandemic but need not be restricted to precise courses of action undertaken in the response to this pandemic event. Our findings are intended to be informative for future epidemic control, particularly at the early stages of an epidemic when there may be no effective pharmaceuticals in sight.
We find that the ideal intervention policy results in a rapid increase in control strength early in the epidemic, followed by a sustained increase over the next 10 weeks as the epidemic reaches its peak, and ultimately a progressive drop in control strength as the population achieves herd immunity. We have also shown how, without having access to the complete history of the epidemic's growth in the population, such a model may give an effective adaptive intervention policy at various stages of the epidemic. This study highlights the significance of implementing control measures as promptly as possible and offers insights into adaptive intervention strategies aimed at reducing the economic effect of epidemics while avoiding undue strain on the healthcare system.

Model structure
We used a deterministic, time-varying Susceptible-Exposed-Infectious-Recovered-Hospitalized in ICU (SEIRH) model [25] to characterize the transmission dynamics in the UK as described in equations (1)-(5): where b is the transmission rate, 1=q and 1=c give the mean latent and infectious periods, respectively and cðtÞ 2 ½0; 1 is the reduction in transmission (such that c(t) ¼ 1 signifies complete cessation of transmission). The state variable H(t) denotes the number of occupied ICU beds and is determined by the probability that an infection is detected (P Detection ), the fraction of cases that require ICU treatment (r ICU ) and the rate of admission to the ICU (c ICU Delay ). The mean duration of stay in the ICU is determined by 1=c ICU Stay . Model parameters and chosen values are presented in Table 1.
In our analyses, we examine changes in optimal intervention policy assuming policies are implemented starting at different points during the epidemic, T 0 . To identify the appropriate initial conditions at these different starting points, we used a particle filter [35] to estimate the effective retrospective daily c(t)

The reward function
As discussed by Moore et al. [36], there is precedent for integrating modeling methodologies and health-economic analyses to inform public health intervention decisions based on a willingness to pay for each Quality-Adjusted Life Year (QALY) saved [22,37,38]. Such an approach allows for allocating explicit monetary values to each term in the reward function [22]. While some cost-benefit analysis via this approach has been carried out in relation to Covid-19 [36], the pandemic's enormous scope renders traditional economic measurements largely impractical. As a result, a health-economic approach is not the main emphasis of this study. Instead, in order to capture the general societal impacts of pandemic mitigation efforts, we have employed a simple 'relative' economic cost to formulate the reward function.
We first introduce the following multi-objective reward function to account for three opposing goals: (i) Sustain viral transmission to achieve herd immunity, (ii) Keep the ICU occupancy below the maximum capacity and (iii) Impose the minimum possible control: r 1 ðtÞ ¼ a 1 r 1 ðtÞ Herd Immunity À a 2 r 1 ðtÞ Exceedance À a 3 cðtÞ 2 ¼ a 1 EðtÞ=N À a2ðHðtÞ À H max Þ=H max À a 3 Ã cðtÞ 2 : We defined r 1 ðtÞ for the sake of mathematical simplicity in deriving PMP solution and it is only used to compare the optimal non-pharmaceutical intervention (NPI) policies obtained from neuroevolution and PMP methods. For the remainder of this study, we use a slightly different objective function, r 2 ðtÞ, defined as follows: rðtÞ ¼ a 1 r Herd Immunity ðtÞ þ a 2 r Exc ðtÞ þ a 3 r Control ðtÞ; ¼ a 1 ðRðtÞ=NÞ À a 2 ReluððHðtÞ À H max Þ=NÞ À a 3 Ã cðtÞ: In both reward functions (equations (6) and (7)), the terms a 1 , a 2 and a 3 modulate the relative importance of herd immunity, healthcare burden and societal costs, respectively. The goal, therefore, is to identify the optimal intervention function c(t) that maximizes the sum of rewards, J, during the course of the epidemic: Pontryagin's maximum principle (PMP) In this section, we first derive the necessary conditions for optimal control via Pontryagin's maximum principle, and describe the iterative numerical algorithm (the forward-backward sweep method) used to find the optimal solution. First, we form the following Hamiltonian function: Here, k s ðtÞ are adjoint functions satisfying the adjoin system: _ k s ðtÞ ¼ À @Hðt; s Ã ðtÞ; c Ã ðtÞ; k Ã s ðtÞÞ @s ; s 2 S; E; I; R; H f g ; (10) Expanding equation (10) yields: The necessary conditions for the optimal control is obtained by maximizing the Hamiltonian (equation (9)) with respect to c(t): The state equations (equations (1)- (5)) and adjoint equations (equations (10)-(16)) together with state initial conditions and transversality conditions (equation (11)) form the 'Optimality system'. The explicit solution cannot be analytically derived. Thus we turned to an iterative numerical method, 'Forward-backward Sweep', to solve the 'Optimality system'.

Neuroevolution algorithm
The optimal policy function, p h , is a feed-forward neural network, parameterized by h that takes the state of the system at current time t, fSðtÞ; EðtÞ; IðtÞ; RðtÞg as input and returns the control strength, c(t). The neuroevolution strategy aims to find the optimal policy function, P G Mostelite , with highest fitness score. Fitness score of policy function j in generation i, f i j , is equal to the sum of rewards, J (equation (8)) and is obtained by running the SEIRH model with the corresponding policy function. First, M policy functions (P 1 j ) are randomly initialized. For each policy function, a trajectory is rolled out and fitness score is calculated at the end of simulation, as shown in Fig. 2. The L policy functions with the highest fitness scores are mutated to generate the next generation of policy functions. Mutation is implemented by adding a random Gaussian noise, scaled by the mutation rate, r, to h parameters of elite policy functions. The new offspring policy functions served as the parents of next generation. This process continues to find a policy function with a sufficiently high fitness score, P G Mostelite . We used a fullyconnected feed-forward network with three 16-unit hidden layers and one tanh output layer to model the policy function. Pseudocode for the neuroevolution algorithm used in this study is provided in Algorithm 1.

Which optimization algorithm?
We compared the optimal intervention policies obtained from PMP and neuroevolution policies (Supplementary Fig. S1). The policies are obtained using the r 1 reward function (equation (6)) with a 2 ¼ 1e À 1; a3 ¼ 5e À 3 and same initial conditions. We found the optimal policies obtained from both methods to be very similar. In simpler problems where an analytic solution can be obtained for the optimality system, the PMP method can provide more insights about the optimal control solution and the dynamics of the system. Otherwise, a neuroevolutionary approach is computationally advantageous since the resulting policy function provides an optimal strategy for a broad range of initial conditions at a substantially smaller computations cost. That is, the PMP optimal intervention for a given initial condition is obtained by solving the boundary-value problem formulated in equations (1)-(5) and equations (10)- (16). For a new boundary condition, the numerical solution must be repeated  to solve the new boundary-value problem. In the remainder of the paper, our optimal solutions are obtained via the neuroevolutionary approach.

Reward function exploration
The relative economic burden of different objectives in the reward function is determined by the weights, fa 1 ; a 2 ; a 3 g. Thus, we examined the effects of variation in these parameters on the resulting optimal policy (see Supplementary Fig. S3). We constrained a 1 to be 1 and changed the values of a 2 and a 3 over a logarithmic grid. For each parameter set, we trained the neuroevolution algorithm for 2000 generations with a population size of 256. The resulting policy functions (purple lines) and corresponding ICU occupancy trajectories of the 10 best-performing agents for each parameter set are depicted in Supplementary  Fig. S3. We found the reward function to be consistently robust to variation in the values of a 2 . That is, the tested range of a 2 values makes the cost of ICU overflow sufficiently prohibitive, leading to high-fitness strategies ensuring ICU maximum capacity is not exceeded (note that the ICU overflow reward is equal to 0 while the ICU occupancy is below the maximum capacity and negative otherwise). Evidently, making a 2 smaller would eventually deprioritize the goal of maintaining the ICU occupancy below the limit. Without loss of generality, we will use a 2 ¼ 1e9 in the remainder of this paper. In contrast, we found the reward function to be highly sensitive to variation in a 3 . For a 3 > 10 À4 , the relative cost (negative reward) of imposing control becomes prohibitive and leads to one of the extreme intervention strategies: Suppression policy to end the endogenous transmission at the earliest possible time and avoid imposing lengthy control measures; or a no-intervention policy that plainly leads to the minimum relative control cost. In practice, the inclination for a specific intervention strategy depends on the policy maker's priorities. We observed pronounced variation in the optimal policies and resulting ICU occupancy trajectories for smaller values of a 3 (compare the first and third columns, Supplementary Fig. S3). In Supplementary Fig. S4, we demonstrate this variation for each parameter set and across the values of a 3 . As shown in Supplementary Fig. S4A, values of a 3 smaller than 10 À4 result in greater 'Cumulative herd immunity reward'. Thus, when the relative cost of control is modest, the optimal policy function will tend to maximize the reward by increasing the number of individuals removed from the susceptible pool, which in turn leads to greater 'Cumulative control reward' (Supplementary Fig. S4B) and longer epidemic duration ( Supplementary Fig. S4C). Therefore, among the tested values, a 3 ¼ 1e À 4 represents the middle ground between prolonged intervention and suppression policies, and is the value that we have used in the rest of this paper.
No-intervention policy, uniform intervention policy and optimal policy Figure 3 presents a comparison between the optimal intervention policy identified via our neuroevolution algorithm, a uniform intervention policy and no-intervention policy. The uniform intervention policy is implemented by imposing a constant reduction in transmission throughout the epidemic, cðtÞ ¼ c u . The value of control strength, c u , is estimated such that the peak ICU occupancy tangents the maximum capacity. Figure 3A depicts the ICU occupancy trajectories of these three policies. As expected, the no-intervention policy leads to ICU burdens well beyond the threshold capacity for more than two months (67 days). The other notable observation is the difference between the optimal and uniform policies in managing the ICU burden: the optimal policy maintains the ICU occupancy near the maximum capacity throughout the epidemic, but not beyond it. Figure 3B depicts the implemented control strength in time for optimal and uniform policies. Except for a period of time less than 10 weeks at the onset of the epidemic, the control strength of the optimal policy is below the uniform intervention policy. The difference in the imposed control between two policies is better illustrated by Fig. 3C, where a widening gap between the cumulative imposed control of the two policies emerges after day 200. In Fig. 3D, we present the recovered individuals for each policy. Unlike the optimal policy, the final fraction of recovered individuals in the uniform intervention policy case is well below the theoretical herd immunity threshold. This suggests that any reduction in the control strength could lead to another epidemic wave given the large fraction of susceptible individuals.

The sooner the better
We have estimated the optimal intervention policy initiated at different stages of the epidemic, as shown in Fig. 4. Each scenario corresponds to a particular start date for the roll out of the optimal intervention policy. Figure 4A depicts the scenario in which optimal intervention policy starts on 1 March, which coincides with a surge in cases in the UK. The optimal intervention policy starts with cðtÞ ¼ 0:33 (a 33% reduction in transmission rates) and is gradually increased to cðtÞ ¼ 0:54 by mid-May. The control strength tapers off to 0 by June 2021. This scenario leads to two peaks in ICU occupancy, in November 2020 and June 2021. Figure 4B-E depict the optimal intervention policy starting at intermediate stages of the epidemic. As mentioned above, we estimated the initial conditions for each scenario by fitting our SIER model to fatality data using particle filtering, a Monte Carlo likelihood estimation algorithm for hidden statespace dynamical systems [39]. Comparing the optimal intervention policy curves in different scenarios depicts how implementing transmission reduction measures at earlier stages of the epidemic will eventually shorten the epidemic: The termination of optimal intervention policy is delayed from June 2021 (in Fig. 4A) to February 2022 (in Fig. 4D). The only exception is Fig. 4E, in which the optimal intervention policy terminates slightly sooner than in Fig. 4D. This is most likely due to the emergence of new variants with higher transmissibility [40], which gave rise to a faster depletion of the susceptible pool than accounted for in our model.
To better illustrate the importance of implementing early control measures, we have demonstrated the 'Total duration of intervention policy implementation' and 'Cumulative imposed control' for different scenarios in Fig. 5. The 'Total duration of intervention policy implementation' represents the time period between 1 March 2020 and the termination date of intervention policy for each scenario. The 'Cumulative imposed control' is obtained by summing the daily implemented control strength (c(t)), divided by total number of days with c(t) > 0 for each scenario. As shown in Fig. 5A, the 'Total duration of intervention policy implementation' increases from 442 days in the first columns to 700 days in the last one. Figure 5B also confirms the fact that implementing the optimal intervention policy from earlier stages of epidemic would reduce the overall required control measures. Note that depicted 'Cumulative imposed control' values do not include the actual imposed control strength (c(t)) before the start of optimal intervention policy and adding those values would only widen their differences. Also, the 'Cumulative imposed control' is a linear measure of overall imposed control; however, the actual economic cost would not necessarily change linearly with duration and strength of imposed intervention policy.  For each scenario, the number of susceptible, exposed, infectious and recovered individuals is estimated from a SEIRH model fitted to the UK fatality data and used as initial condition to derive the optimal intervention policy Finding the balance Figure 6 paints an overall picture of how the optimal policy fine tunes the transmission rates to sustain endogenous transmission in the population without overburdening the ICU capacity. Figure 6A demonstrates the variation of effective reproductive ratio (R eff ) throughout the epidemic (black line), the control strength is also shown (blue dashed line). At the onset of the epidemic, R eff is instantly reduced to 1.52 from 2.3 by imposing a 0.33 reduction in contact rates cðtÞ ¼ 0:33 and further decreased to R eff % 1 by mid-May (point i) to stall the epidemic growth. From point i to point ii, The R eff is maintained close to 1 to maintain the ICU occupancy close to the maximum capacity. At this point, c(t) is slightly increased that leads to a sharp decrease of R eff to 0.89 in point iii. This is followed by a steep decrease in c(t) to bring the R eff above 1 to sustain the transmission. To summarize, the optimal mitigation policy is achieved by finding the balance between two extreme scenarios: Suppression policy which aims to stall the endogenous transmission in the population, and 'No-intervention' which leads to exponential epidemic growth and the overburdening of healthcare capacity.

DISCUSSION
More than 18 months into the SARS-CoV-2 pandemic, it is becoming increasingly clear that countries that implemented suppression strategies early on experienced greater success in managing both the public health and economic burden of the epidemic [9][10][11]. However, such strategies work best when employed early in the epidemic, when number of cases is relatively small. Moreover, in countries where government-imposed restrictions are not well received by the public, implementation of such policies will be challenging. Looking back at the early stages of the epidemic, our work provides a dynamic mitigation strategy that sustains the community transmission without overwhelming the healthcare capacity. March 2020 and termination date of intervention policy for each scenario. The Cumulative imposed control is obtained by adding up the implemented control strength (c(t)) in each day, divided by total number of days with c(t) > 0 for each scenario A number of previous studies on optimal nonpharmaceutical interventions have used quadratic cost expressions for the control term in the cost function [19,41,42]. This is mainly because when the cost function is quadratic with respect to the control, the differential equations arising from the necessary conditions for an optimal control have a known solution. Other functional forms frequently provide difficult-to-solve systems of differential equations. To circumvent this, we employed a neuroevolution algorithm which enabled us also to explore non-quadratic functions. The neuroevolution algorithm was used to train a policy function that takes the epidemiological state of population (the numbers of susceptible, exposed, infectious and recovered individuals) on each time day and provides the corresponding control strength. We defined a multi-objective reward function to account for three conflicting goals: Sustain the transmission to achieve herd immunity when suppression is not feasible, maintaining the ICU occupancy below the maximum capacity and imposing minimum possible control measures to reduce the contact rates. A relative weighting parameter was assigned to corresponding terms of each of these objectives in the reward function. The sensitivity analysis indicated that the resulting policy function is highly sensitive relative weighting of the control term and found a optimal range of values for it. We chose United Kingdom as our target population and fitted an SEIRH model to fatality data to estimate the initial conditions at different stages of the epidemic.
The optimal intervention policy confirmed the importance of early interventions to reduce the contact rates in the population, as highlighted in the previous studies [15,42]. An initial 34% reduction in transmission at the onset of the epidemic, gradually increasing to 50% in the next 10 weeks is required to bring the R eff near 1. After that, the restrictions are constantly decreased as the size of susceptible pool diminishes. The association between the control strength and the size of the susceptible pool (except the first initial 10 weeks) highlights the importance of reliable and widespread serosurveys in order to inform policy decision making.
Our study highlights the neuroevolution algorithm, a gradient-free approach, as an efficient alternative to traditional PMP method for finding the optimal non-pharmaceutical intervention policy in dynamical disease transmission system. Past studies have demonstrated that in many challenging Figure 6. The optimal intervention policy maintains the effective reproductive ratio (R eff ) close to 1: The figure displays the changes in effective reproductive ratio when implementing the optimal intervention policy. The control strength (c(t)) is sharply increased at early stages of epidemic to stall the epidemic growth and keep healthcare capacity from being overwhelmed. The R eff is maintained close to 1 by gradually reducing the c(t) as the size of susceptible pool shrinks. Once the value of R eff reaches below 0.9, c(t) is increased to sustain the transmission in the population, while keeping the occupied ICU beds below the maximum capacity reinforcement learning tasks, neuroevolution algorithm rivals (or even outperforms in some domains) state-of-the art gradient-based methods such as Q-learning and A3C [21]. Interestingly, the forward-backward sweep technique that we used to obtain the optimal solution via PMP closely resembles the backpropagation, the algorithm used to train the gradientbased reinforcement learning methods [43]. Ultimately, we found the neuroevolution algorithm to be computationally advantageous to the PMP method as the former algorithm provides the optimal intervention policy for a broad range of initial values after initial training (as shown in Fig. 4), while the numerical solution to obtain the optimal control via PMP must be repeated for a new initial condition.
A key component of our neuroevolution algorithm is the assumption that the full epidemiological state of the population is observable at each time step. In reality, however, the observable data provide an incomplete and potentially biased picture of epidemiology since they are based on reported incidence, hospitalization and fatality data in addition to seroprevalence surveys. Besides assuming complete epidemiological information, our approach also assumed that the optimal intervention policy is implemented in deterministically; that is, the output action is perfectly implemented at each time instant and the resulting new state given the corresponding action is always the samesomething that is not practical. An important next step in this area would be to extend our novel framework to identify the optimal intervention strategies with hidden states in a stochastic setting. Furthermore, while this study addresses the optimal reduction in the contact rates over time, the economic cost and effectiveness of various non-pharmaceutical intervention mechanisms [44,45] to achieve the optimal policy reduction requirements must also be examined.