Network-augmented compartmental models to track asymptomatic disease spread

Abstract Summary A major challenge in understanding the spread of certain newly emerging viruses is the presence of asymptomatic cases. Their prevalence is hard to measure in the absence of testing tools, and yet the information is critical for tracking disease spread and shaping public health policies. Here, we introduce a framework that combines classic compartmental models with travel networks and we use it to estimate asymptomatic rates. Our platform, traSIR (“tracer”), is an augmented susceptible-infectious-recovered (SIR) model that incorporates multiple locations and the flow of people between them; it has a compartment model for each location and estimates of commuting traffic between compartments. TraSIR models both asymptomatic and symptomatic infections, as well as the dampening effect symptomatic infections have on traffic between locations. We derive analytical formulae to express the asymptomatic rate as a function of other key model parameters. Next, we use simulations to show that empirical data fitting yields excellent agreement with actual asymptomatic rates using only information about the number of symptomatic infections over time and compartments. Finally, we apply our model to COVID-19 data consisting of reported daily infections in the New York metropolitan area and estimate asymptomatic rates of COVID-19 to be ∼34%, which is within the 30–40% interval derived from widespread testing. Overall, our work demonstrates that traSIR is a powerful approach to express viral propagation dynamics over geographical networks and estimate key parameters relevant to virus transmission. Availability and implementation No public repository.


Introduction
At the outset of the COVID-19 pandemic, the prevalence of asymptomatic cases among infections was estimated to lie anywhere between 17% and 81% (Nogrady, 2020). Given the importance of this parameter for early health policy decisions (Nishiura et al., 2020), such a high level of uncertainty was a major roadblock. With testing now widely available, this issue has largely dissipated, with estimates of asymptomatic rates between $30% and $40% Shang et al., 2022). To prevent such difficulties in future epidemics, it would be highly beneficial to have computational tools for estimating the asymptomatic rate of infected individuals right at the beginning of an epidemic.
Infectious disease spread is classically modeled using compartmental models. The population is assigned to distinct compartments (e.g. the susceptible, infectious and recovered compartments in the widely studied SIR model) (Kermack and McKendrick, 1927), with rates at which individuals move from one compartment to another. When applying these compartment-based epidemiological models, it is impossible to predict the true prevalence of a virus early on in a pandemic without widespread random testing: indeed, even a tiny fraction of a population showing symptoms for the disease is compatible with a widespread infection. To estimate via computational modeling the fraction of infectious individuals that are asymptomatic, or the asymptomatic rate q, requires additional information. Here, we show that considering information about how a virus spreads in a spatial manner-not just between compartments at a single location-can be leveraged to estimate q. The intuition is that, while individuals travel between locations and this contributes to viral spread, individuals who feel sick (i.e. are symptomatic) tend to curb travel, which in turn yields a distinguishing observable between symptomatic and asymptomatic carriers.
In this article, we introduce traSIR (pronounced "tracer"), a network traffic-based SIR model, which combines the classic SIR compartmental model with network modeling. In traSIR, we have a network where each node is a location (e.g. a county or ZIP Code), each location is associated with a compartmental model and edges in the network represent frequent travel between the locations (e.g. commuting). TraSIR additionally models asymptomatic and symptomatic infections, together with a dampening effect on viral spread for symptomatic infections. Our primary contribution is to demonstrate the utility of traSIR in estimating the asymptomatic rate of an infectious disease using only knowledge about symptomatic infections across geographic locations, as well as information about typical travel between locations.
We begin with theoretical results relating the asymptomatic rate of infection to other key parameters of the model (e.g. infection and recovery rates). Since these key parameters are not known a priori and must be estimated from the data, we next assess how well parameters of a traSIR model can be estimated using only knowledge about symptomatic infections. In particular, we simulate disease spread using traSIR, and then perform empirical parameter estimation using the number of symptomatic infections over time across locations to estimate the asymptomatic rate q. Across a wide range of parameters, we find excellent agreement between the actual and estimated q values. Finally, we analyze the number of reported COVID-19 infections across the New York metropolitan area during the first wave, from March 1, 2020 to September 17, 2020.
The method behind traSIR seeks to combine topological flow information with diagnostic data and behavioral variations. It makes use of a number of observable nonlinearities: (i) in the absence of public health measures, a multiplicative decrease in the symptomatic rate causes a forward time-shift in the infection curve relative to its measurable baseline; (ii) detection of carriers grow superlinearly in the number of symptomatic cases; (iii) the number of newly symptomatic cases is largely determined by the asymptomatic neighbors in the network and (iv) asymptomatic carriers have a different transmissibility rate (Li et al., 2020). Our platform, traSIR, is the first of its kind to integrate county-level data with a commuter network on a large scale to recover critical epidemiological characteristics directly from network dynamics, in particular the asymptomatic rate.

Further background
Standard epidemiological models have previously been extended to account for disease spread across space, but the medium has typically been assumed to be homogeneous (where the population is treated as one large group, as opposed to interacting subpopulations) (Brauer et al., 2008;Eletreby et al., 2020), leading to a diffusive process. Typically, the speed of a wave across the population grows in proportion to the square root of the reproduction number and the diffusion coefficient. Epidemics have also previously been studied in random graphs and scale-free networks (Ajelli et al., 2010;Barrat et al., 2008). Previous work has also considered the correlation of viral spread with changing commuting patterns as well as signals from social media or search engines (Byambasuren, 2020;Li et al., 2020;Poletti et al., 2021;Subramanian and Pascual, 2021;Sun et al., 2021;Zhan et al., 2018); other approaches have integrated network effects into compartmental models (Ameri and Cooper, 2019;Barabá si, 2013;Ding et al., 2021;Dottori and Fabricius, 2015;Liu et al., 2018). However, they lack any symmetry-breaking mechanism for distinguishing between symptomatic and asymptomatic carriers. This is precisely what traSIR offers.
Even agent-based modeling does not directly resolve this issue. While agent-based modeling certainly provides enhanced resolution and a distribution of outcomes (certain aspects of which we leverage by decomposing our model to the county level), agent-based modeling has not been able to properly estimate asymptomatic spread (Kerr et al., 2021). One significant downside of agent-based modeling is that it is also computationally expensive, even when leveraging vectorized features to control spread mechanisms (Dabke and Arroyo, 2016).

Symmetry-breaking and asymptomatic spread
In order to properly estimate the asymptomatic rate, we need to be able to distinguish between symptomatic and asymptomatic spread. The model we present in this article allows us to distinguish between these two types of spread via different interaction patterns: we can have varying commuting and travel patterns among asymptomatic and symptomatic populations.
Our results show that under simple assumptions or static parameters, we can still distinguish the spread within these two populations. This implies that having some symmetrybreaking is paramount and while we have ample room for refinement, mere existence of symmetry-breaking is sufficient for estimating the asymptomatic rate accurately.
The asymptomatic rate can be estimated clinically (Johansson et al., 2021;Martinelli et al., 2022;Sah et al., 2021); while this is the most accurate approach, it requires large-scale surveillance testing, which can be prohibitively resource-intensive. In contrast, computational approaches tend to add compartments to traditional SIR models, which is also our strategy. In Li et al. (2021), they add multiple compartments; we add just one asymptomatic compartment, which simplifies our model and increases computational tractability. Our approach is most similar to Layton and Sadria (2022), but we generalize to a larger geography and incorporate commuter data. For any modeling task, especially with a novel infection, aggregating different modeling approaches and data sources usually yields a strong consensus estimate; therefore, we hope to add to the literature by providing a modeling refinement and an additional estimate of the asymptomatic rate of spread to the ensemble.

The model
We show how to embed the classic SIR epidemiological model (Kermack and McKendrick, 1927) within a geographic network with known travel rates. The network G ¼ ðV; EÞ is a directed graph joining N nodes (typically, counties), whose edges are annotated with the corresponding mean traffic rates of commuters. The edge set E includes all the pairs ði; jÞ such that residents of county i commute to work in county j. We assume the availability of an N-by-N stochastic "commute" matrix M, such that M ij indicates the probability that someone commutes from county i to county j on a typical workday.
On day t, we denote the number of susceptible and recovered individuals in county i by s i ðtÞ and r i ðtÞ, respectively. Among the f i ðtÞ carriers of the virus in the county, we distinguish between the c i ðtÞ of them who show symptoms and the a i ðtÞ ¼ f i ðtÞ À c i ðtÞ who do not. The population size in county i is denoted by n i ¼ f i ðtÞ þ s i ðtÞ þ r i ðtÞ and is assumed fixed over the period under investigation. For convenience, we may write the right-hand side as P x2ff ;s;rg x i ðtÞ. The commute matrix M is blind to the health status of commuters. Symptomatic people tend to travel less, however, and this change has great effect on contagion. To capture this phenomenon, we introduce the decommute rate d 2 ½0; 1 as a measure of the propensity of people feeling sick to stay home: (1) where I represents the identity matrix. Note that, if d ¼ 0, being symptomatic has no bearing on commuting. The matrix M c is a symmetry-breaking device which allows to distinguish between sick virus carriers and the rest. This difference creates observable nonlinearities in the viral dynamics that we can exploit to estimate the asymptomatic rate q. While our model could allow M x to be a function of time, we simply require M c to be different from the other transition matrices in order to get the benefit of symmetry-breaking. Therefore, we let all of them be static and set In the results reported here, we set d to 8=9.

The chronology of infection
Instead of stating the model all at once, we introduce it one piece at a time, following its natural chronology. We fix a county i and trace the changes in the main state variables s, c, a, r, f, n. We use specific times for illustrative purposes only. • Step 1: At 8 am on day t, all commuters are ready to go to work. We have f i ðtÞ ¼ c i ðtÞ þ a i ðtÞ and P x2fs;c;a;rg x i ðtÞ ¼ n i ðtÞ ¼ n i . • Step 2: At 9 am, commuters are at work. This changes the local population into a transient one, which we denote with a "hat." By definition of the commute matrix, x i ðtÞ ¼ P j M x ji x j ðtÞ for x ¼ s; c; a; r, withf i ðtÞ ¼ P x2fc;agx j ðtÞ andn i ðtÞ ¼ P x2fs;f ;rgx j ðtÞ. The transient population at county i will now get to mix all day at work and spread the infection among itself. • Step 3: At 5 pm, commuters go home. The new population at county i is denoted with a "bar." It consists of the same n i people present at 8 am, but with a different health status distribution. Take the set of infected individuals: it includes the f i ðtÞ carriers from 8 am plus the newly infected. The latter consist of the subset of the s i ðtÞ susceptible individuals who caught the virus by commuting to county j and got exposed to a carrier in the transient population of j. Note that this includes the case j ¼ i of noncommuters who were exposed to infected visitors. The chance of anyone getting sick in this fashion is u j ðtÞ :¼ bf j ðtÞ=n j ðtÞ, where 0 < b < 1 measures the transmission rate: it is the average number of contacts per person per day times the probability of transmission in a contact between an infected person and a susceptible one.
(The model can easily accommodate a time-varying rate b. The reason for keeping it fixed is to decouple the baseline socialization rate from its pandemic-induced variations via the commute matrices.) The number of newly infected residents of county i is the sum, over all j, of the number of commuters from county i who went to county j and got infected there: therefore, it is equal to s i ðtÞw i ðtÞ, where w i ðtÞ :¼ P j M ij u j ðtÞ < 1 denotes the worktime infectivity rate: it is the probability that a commuter from i catches the virus at work.
We have f i ðtÞ ¼ f i ðtÞ þ s i ðtÞw i ðtÞ. Since a fraction q of these new infections are asymptomatic, we have s i ðtÞ ¼ s i ðtÞð1 À w i ðtÞÞ c i ðtÞ ¼ c i ðtÞ þ ð1 À qÞs i ðtÞw i ðtÞ a i ðtÞ ¼ a i ðtÞ þ qs i ðtÞw i ðtÞ: 8 < : (2) • Step 4: At 8 am on day t þ 1, further mixing will have occurred in county i since the previous evening. A fraction c of the infected people will have recovered by then. Writing we have We note that traSIR involves two rounds of mixing: the first one in the daytime accounts for intercounty infection (via commuting); the second one (nighttime) models intracounty infection (within each county). For simplicity, we model recovery in the latter only. (For this reason, our value of c might differ from the standard one by a factor of 2.)

Parameter estimation
Given a commute network and daily symptomatic infections across each node in the network, we develop an approach for estimating the asymptomatic rate q. The estimation algorithm can be viewed as a two-player game in which participants take turns updating their current estimates of ðb; cÞ and q, respectively. Recall that b; c measure the infection and recovery rate, respectively. We assume that all counties have the same value of b and c. The updating is driven by grid search (and gradient descent) with respect to a normalized mean-square loss function, which is computed for a node k across all time points as follows: where c is the vector in ½0; 1 T whose coordinate cðtÞ denotes the recorded rate of symptomatic cases in the population in some given county k at time t. We write c ¼ c k when disambiguation is needed. The normalization makes the loss invariant under scaling. This is a necessary feature given the noise in the data. Of highest concern is the corruption of the official figures caused by the inclusion of reported asymptomatic cases via testing and the exclusion of symptomatic patients who do not seek a diagnosis. We assume that the signal-to-noise ratio remains constant over time; hence, that the time series c is available up to an unknown scaling factor. The normalization factors out that uncertainty.
The vectorĉ ¼ĉðb; c; qÞ is the traSIR-predicted counterpart to the factual vector c; the matrix M and the decommute rate, defined in (1), are fixed. We assume that the infection is seeded at county i 0 . With q expected to exert a relatively minor influence on the transmission/recovery parameters at the seeded node, it is natural to base the estimate of ðb; cÞ on the time series c i0 .Within Algorithm 1, we set j max ¼ 3 (convergence is quick). The grid search is over a discrete space of size 10 3 for q and 10 4 for ðb; cÞ. The number of gradient descent steps is k max ¼ 10 3 ; the gradient descent threshold is s min ¼ 10 À12 and the learning rate is e ¼ 10 À4 =NT. Note that the output ðb Ã ; c Ã ; q Ã Þ will be referred to as the estimated parameters ðb;ĉ;qÞ in the text.

Actual data
For the commute network and population data, we rely on the most recent (pre-COVID) American Community Survey from the U.S. Census Bureau (American Community Survey [ACS], 2015a,b). The nodes in the network represent the counties; the edges are directed and weighted in proportion to the number of residents who live in the source county and work in the destination county. We clean up the data by removing all the edges associated with fewer than 10 000 commuters. From the resulting graph, we extract the largest weakly connected component, which in this case corresponds to the New York Metropolitan Area. It consists of 44 counties: a visualization of which can be seen in Figure 1. For the infection data, we use the New York Times COVID-19 tracker and focus on the 200 days between March 1, 2020 and September 17, 2020 (Connell, 2022; The COVID Tracking Project at The Atlantic, 2022; The New York Times, 2022) (https://www.census.gov/data/tables/2015/ demo/metro-micro/commuting-flows-2015.html and https:// www.census.gov/programs-surveys/acs/technical-documenta tion/table-and-geography-changes/2015/5-year.html).

Simulated data
We generate 481 low-discrepancy values of b, c and q, where 0:2 b; q 0:8; 0:01 c 0:7, using Sobol sequences from the SciPy package (https://docs.scipy.org/doc/scipy/refer ence/generated/scipy.stats.qmc.Sobol.html). For each of the 481 combinations of parameters, we run traSIR with the corresponding parameters for 150 timesteps on the New York Metropolitan area population and network data, assuming that there is a single infected individual in New York County (Manhattan). We further corrupt the resulting symptomatic population sizes by a fixed scalar that is unknown to the algorithm.
For validation, we run Algorithm 1 on the corrupted simulation to produce the estimated parameters ðb;ĉ;qÞ. We evaluate the accuracy and tabulate the residuals between the estimation and actual parameters ðb; c; qÞ.

Results
The main contribution of this article is to demonstrate empirically that a network-based epidemiological model can uncover key parameters of a contagious disease. We provided an intuitive explanation for why this might be possible as long as a symmetry-breaking mechanism is in place for distinguishing among different types of virus carriers. Before we discuss the procedure ESTIMATE(c) q Ã 0:5 for ' ¼ 1; 2; . . . ; j max do . use grid search to optimize ðb Ã ; c Ã Þ via normalized mean-square loss function at initial county i 0 ðb Ã ; c Ã Þ argmin ðb;cÞ Lðc i0 ;ĉ i0 ðb; c; q Ã ÞÞ . use grid search to optimize q Ã via normalized meansquare loss function across all counties q Ã argmin q P N i¼1 Lðc i ;ĉ i ðb Ã ; c Ã ; qÞÞ . gradient descent on q Ã gðxÞ : return ðb Ã ; c Ã ; q Ã Þ empirical evidence and validate our approach, we provide a succinct mathematical foundation for our claim.

Theoretical analysis
We fix the county i and the time t and we drop all mention of t when it is understood from the context. By Equations (2) and (3), where Recall thatf i ðtÞ denotes the number of infected individuals in the transient population at county i at the end of the morning commute. Let f 0 i ¼ P k f k M ki be the number it would have been if we had d ¼ 0 and hence M c ¼ M; we derive n 0 i ¼ P k n k M ki fromn i ðtÞ likewise. We havê where g j ¼ f 0 j À f j . This allows us to rewrite w i as The worktime infectivity rate w i plays a key role in traSIR. If M ¼ I, then w i ¼ bf i =n i is the usual infectivity rate in the classic SIR model. Take the case of an arbitrary matrix M and set d ¼ 0. Denote by E jðiÞ the expectation operator indexed by i and defined by M ij , for j ¼ 1; . . . ; N. Likewise, we introduce the expectation operator E kðjÞ , indexed by j and defined by n k M kj = P l n l M lj , for k ¼ 1; . . . ; N. It follows that We conclude that, when decommuting is withheld (d ¼ 0), w i is an average of infection ratios f k =n k over counties adjacent to i or adjacent to the latter. This two-degree of separation corresponds to individuals from distinct counties meeting at work in a third county. The same idea holds for d > 0, but with corrective terms that we discuss below.
In epidemiology, an important characteristic of an infection is the basic reproduction number R 0 , which measures the average number of cases generated by an infected individual (Anderson and May, 1991). At the outset of the pandemic, we can use f i ðt þ 1Þ=f i ðtÞ as a proxy for the reproduction number R 0 associated with county i. In a classical SIR model, R 0 has the form b=c, but in TraSIR, it follows from (8) that Together, Equations (10) and (12) form a system SðqÞ ¼ 0, which in theory allows us to recover the asymptomatic rate q from b, c and R 0 . It is noteworthy that this requires decommuting. The system S cannot be solved for q in closed form. Using traSIR for estimation can thus be viewed as a numerical solver for S.

Simulations
We demonstrate that Algorithm 1 can accurately recover the infection rate b, recovery rate c and asymptomatic rate q in simulated infections across a wide range of parameters, using just knowledge about the network and the numbers of symptomatic infected individuals. For each of simulations resulting from many combinations of parameters (see Methods), we will use the number of symptomatic individuals for each county over time. In practice, the actual number of symptomatic individuals is larger than the number reported, we multiply each of the resulting symptomatic population sizes by a fixed scalar (unknown to the algorithm), and then run Algorithm 1 to produce the estimated parameters ðb;ĉ;qÞ.
We find excellent agreement between the actual parameters b, c and q and their estimates ðb;ĉ;qÞ (Fig. 2). Figure 2 shows a scatter plot of an estimated parameter against the corresponding synthetic parameter for the New York area. The Pearson correlation coefficient is 0.9996 between b and its predicted value. For c and q, it is 0.9983 and 0.9915, respectively. The absolute residual across all starting parameters has mean 0.023 and standard deviation 0.017. The absolute residual mean and standard deviation for b are 0.0032 and 0.00246; for c are 0.0065 and 0.0064 and for q are 0.0158 and 0.0163.

Applications to COVID-19 data
Having validated our estimation technique on simulated data, we now apply Algorithm 1 to daily infection numbers from the New York Metropolitan area (see Methods), and estimate the asymptomatic rate q, a parameter of critical importance to health policymakers. We find: b ¼ 0:320 ; c ¼ 0:046 ; q ¼ 0:345 : Using these parameters, we also compared the traSIRsimulated symptomatic infection count with the real reported infection count across the New York metropolitan area (Fig. 3), and find good agreement.

Discussion
We have shown via theoretical analysis and simulation that a network-augmented compartmental model can effectively estimate the asymptomatic rate of viral infections using only data about symptomatic infections. The theoretical estimation is derived through (10) and (12). For simulation, we have applied this approach to actual COVID-19 data and derived an estimate of the asymptomatic rate that matches well with the latest estimates obtained via extensive random testing.
While our results are based on the different interaction patterns among asymptomatic and symptomatic populations, it is also possible to distinguish between these two types of spread in two more ways: 1) Distinct infection rates: we can impose dissimilar transmissibility rates in the two types of spread. 2) Differentiated seeding: we can account for several types of spread starting in various locations, perhaps based on local policy or other environmental factors.
Taken together, these factors are a broad set of levers that can influence the emergent behavior of our model, thus potentially enhancing the accuracy of our traSIR model.
Our estimates for b and c are sharper than for q. This is no surprise. Both the transmission rate and the recovery rate have direct influence on the local shape of the infection time series: the first one has a large effect on the ascent phase of the contagion while the other one's impact can be felt most acutely in the descent phase. The impact of the asymptomatic rate q is more global and subtle. It can be felt in the speed of the traveling waves and generally operates on longer time scales. TraSIR is able to leverage such information. Credit for our success must also go to sheer luck: An asymptomatic rate of $30% is almost ideally sized for estimation. As we observed earlier, a rate close to 100% would make the task hopelessly difficult. This leaves open the possibility that other nonlinearities in the system can be exploited to boost accuracy when needed. While fast-changing health policy measures and medical breakthroughs (e.g. vaccination) can present traSIR with major challenges, they also create new windows of opportunity for novel estimation mechanisms. We hope that this work will plant the seeds for exciting new research on the messy, difficult, but fascinating subject of uncovering hidden epidemiological parameters.