Inferring networks from time series: A neural approach

Abstract Network structures underlie the dynamics of many complex phenomena, from gene regulation and foodwebs to power grids and social media. Yet, as they often cannot be observed directly, their connectivities must be inferred from observations of the dynamics to which they give rise. In this work, we present a powerful computational method to infer large network adjacency matrices from time series data using a neural network, in order to provide uncertainty quantification on the prediction in a manner that reflects both the degree to which the inference problem is underdetermined as well as the noise on the data. This is a feature that other approaches have hitherto been lacking. We demonstrate our method’s capabilities by inferring line failure locations in the British power grid from its response to a power cut, providing probability densities on each edge and allowing the use of hypothesis testing to make meaningful probabilistic statements about the location of the cut. Our method is significantly more accurate than both Markov-chain Monte Carlo sampling and least squares regression on noisy data and when the problem is underdetermined, while naturally extending to the case of nonlinear dynamics, which we demonstrate by learning an entire cost matrix for a nonlinear model of economic activity in Greater London. Not having been specifically engineered for network inference, this method in fact represents a general parameter estimation scheme that is applicable to any high-dimensional parameter space.


Introduction
N etworks are important objects of study across the scientific disciplines.They materialise as physical connections in the natural world, for instance as the mycorrhizal connections between fungi and root networks that transport nutrients and warning signals between plants [1,2], human traffic networks [3,4], or electricity grids [5,6].However, they also appear as abstract, non-physical entities, such as when describing biological interaction networks and food webs [7][8][9], gene or protein networks [10][11][12][13], economic cost relations [14,15], or social links between people along which information (and misinformation) can flow [16][17][18].In all examples, though the links constituting the network may not be tangible, the mathematical description is the same.In this work, we are concerned with inferring the structure of a static network from observations of dynamics on it.The problem is of great scientific bearing: for instance, one may wish to understand the topology of an online social network from observing how information is passed through it, and some work has been done on this question [19][20][21].Another important application is inferring the connectivity of neurons in the brain by observing their responses to external stimuli [22,23].In an entirely different setting, networks crop up in statistics in the form of conditional independence graphs, describing dependencies between different variables, which again are to be inferred from data [24,25].
Our approach allows inferring network connectivities from time series data with uncertainty quantification.Uncertainty quantification for network inference is important for two reasons: first, the observations will often be noisy, and one would like the uncertainty on the data to translate to the uncertainty on the predicted network.Secondly however, completely inferring large networks requires equally large amounts of data -typically at least N − 1 equations per node, N being the number of nodes -and these observations must furthermore be linearly independent.Data of such quality and quantity will often not be available, leading to an underdetermined inference problem.The uncertainty on the predicted network should thus also reflect (at least to a certain degree) the 'non-convexity' of the loss function, i.e. how many networks are compatible with the observed data.To the best of our knowledge, no current network inference method is able to provide this information.
Network inference can be performed using ordinary least squares (OLS) regression [6,26], but this is confined to the case where the dynamics are linear in the adjacency matrix.An alternative are sampling-based methods that generalise to the non-linear case [27][28][29], but these tend to struggle in very high-dimensional settings and can be computationally expensive.Efficient inference methods for large networks have been developed for cascading dynamics [19][20][21], but these are highly specialised to a particular type of observation data and give no uncertainty quantification on the network prediction.Our method avoids these limitations.Its use of neural networks is motivated by their recent and successful application to low-dimensional parameter calibration problems [30,31], both on synthetic and real data, as well as by their conceptual proximity to Bayesian inference, e.g. through neural network Gaussian processes or Bayesian neural networks [32][33][34][35][36][37].Our method's underlying approach ties into this connection, and in fact, since it has not been specifically engineered to fit the network case, constitutes a general and versatile parameter estimation method.

Method description
We apply the method proposed in [31] to the network case.The approach consists of training a neural network to find a graph adjacency matrix Â ∈ R N ×N that, when inserted into the model equations, reproduces the observed time series T = (x 1 , ..., x L ).A neural network is a function u θ : R N ×q → R p , where q ≥ 1 represents the number of time series steps that are passed as input.Its output is the (vectorised) estimated adjacency matrix Â, which is used to run a numerical solver for B iterations (B is the batch size) to produce an estimated time series T( Â) = (x i , ..., xi+B ).This in turn is used to train the internal parameters θ of the neural net (the weights and biases) via a loss function J Â T .The likelihood of any sampled estimate is and by Bayes' rule, the posterior density is then with π 0 the prior density [38].As Â = Â(θ), we may calculate the gradient ∇ θ J and use it to optimise the internal parameters of the neural net using a backpropagation method of choice; popular choices include stochastic gradient descent, Nesterov schemes, or the Adam optimizer [39].Calculating ∇ θ J thus requires differentiating the predicted time series T, and thereby the system equations, with respect to Â.In other words: the loss function contains knowledge of the dynamics of the model.Finally, the true data is once again input to the neural net to produce a new parameter estimate Â, and the cycle starts afresh.A single pass over the entire dataset is called an epoch.Using a neural net allows us to exploit the fact that, as the net trains, it traverses the parameter space, calculating a loss at each point.Unlike Monte-Carlo sampling, the posterior density is not constructed from the frequency with which each point is sampled, but rather calculated directly from the loss value at each sample point.This entirely eliminates the need for rejection sampling or a burn-in: at each point, the true value of the likelihood is obtained, and sampling a single point multiple times provides no additional information, leading to a significant improvement in computational speed.Since the stochastic sampling process is entirely gradientdriven, the regions of high probability are typically found much more rapidly than with a random sampler, leading to a high sample density around the modes of the target distribution.We thus track the neural network's path through the parameter space and gather the loss values along the way.Multiple training runs can be performed in parallel, and each chain terminated once it reaches a stable minimum, increasing the sampling density on the domain, and ensuring convergence to the posterior distribution in the limit of infinitely many chains.
We begin this article with two application studies: first, we infer locations of a line failure in the British power grid from observations of the network response to the cut; and secondly, we infer economic cost relations between retail centres in Greater London.Thereafter we conduct a comparative analysis of our method's performance, before finally demonstrating the connection between the uncertainty on the neural net prediction and the uncertainty of the inference problem.

Inferring line failures in the British power grid
Power grids can be modelled as networks of coupled oscillators using the Kuramoto model of synchronised oscillation [40][41][42][43][44].Each node i in the network either produces or consumes electrical power P i while oscillating at the grid reference frequency Ω.The nodes are connected through a weighted undirected network A = (a ij ), where the link weights a ij ∼ Y ij U 2 ij are obtained from the electrical admittances Y ij and the voltages U ij of the lines.The network coupling allows the phases φ i (t) of the nodes to synchronise according to the differential equation [43] where α, β, and κ are the inertia, friction, and coupling coefficients respectively.A requirement for dynamical stability of the grid is that i P i = 0, i.e. that as much power is put into the grid as is taken out through consumption and energy dissipation [42].
A power line failure causes the network to redistribute the power loads, causing an adjustment cascade to ripple through the network until equilibrium is restored [5].In this work we recover the location of a line failure in the British power grid from observing these response dynamics.Figure 1a shows the high-voltage transmission grid of Great Britain as of January 2023, totalling 630 nodes (representing power stations, substations, and transformers) and 763 edges with their operating voltages.Of the roughly 1300 power stations dotted around the island, we include those 38 with installed capacities of at least 400 MW that are directly connected to the national grid [45]; following [5,42] we give all other nodes a random value P i ∼ U[−200, +200] such that i P i = 0.
We simulate a power cut in the northeast of England by iterating the Kuramoto dynamics until the sys- tem reaches a steady state of equilibrium (defined as | φi |/φ i ≤ 0.01 ∀i) and then removing two links and recording the network response (fig.1b).From the response we can infer the adjacency matrix of the perturbed network Ã (with missing links) and, by comparing with the unperturbed network A 0 (without missing links), the line failure locations.We let a neural network output a (vectorised) adjacency matrix Â and use this estimated adjacency matrix to run the differential equation [3], which will produce an estimate T of the observed time series of phases T. A hyperparameter sweep on synthetic data showed that using a deep neural network with 5 layers, 20 nodes per layer, and no bias yields optimal results (see figs.S2-S4 in the appendix).We use the hyperbolic tangent as an activation function on each layer except the last, where we use the 'hard sigmoid' [49,50]   which allows neural net output components to actually become zero, and not just asymptotically close, thereby ensuring sparsity of the adjacency matrix -a reasonable assumption given that the power grid is far from fully connected.We use the Adam optimizer [39] with a learning rate of 0.002 for the gradient descent step.Since the neural network outputs are in [0, 1], we scale the network weights a ij → λa ij such that a ij ∈ [0, 1], and absorb the scaling constant λ into the coupling constant κ; see the Supplementary Information for details on the calculations.
We use the following loss function to train the internal weights θ of the neural network such that it will output an adjacency matrix that reproduces the observed data: The first summand is the data-model mismatch, the second penalises asymmetry to enforce undirectedness of the network, and the third sets the diagonal to zero (which cannot be inferred from the data, since all terms sin(θ j − θ i ) = 0 for i = j).ν = ν(s) is a function of the iteration count s designed to let the neural network search for Ã in the vicinity of A 0 , since we can assume a priori that the two will be similar in most entries.To this end we set ν = 10 while the loss function has not yet reached a stable minimum, quantified by |⟨∂ s J⟩| > 10 −10 and |⟨∂ ss J⟩| > 10 −10 , and ν = 0 thereafter.Here, ⟨•⟩ is a rolling average over a window of 20 iterations, see fig.
2. In other words, we push the neural network towards a stable minimum in the neighbourhood of A 0 and, once the loss stabilises, permanently set ν = 0.
In theory L = N − 1 observations are needed to completely infer the network, though symmetries in the data usually mean L > N is required in practice [51].In this experiment we purposefully underdetermine the problem by only using L < N − 1 steps; additionally, we train the network on data recorded 1 simulated second after the power cut, where many nodes will still be close to equilibrium.Though the neural network may be unable to completely infer the network, it can nevertheless produce a joint distribution on the network edge weights p Â T , recorded during the training, that allows us to perform hypothesis testing on the line failure location.The marginal likelihood on each network edge âij is given by where the −ij subscript indicates we are omitting the ij-th component of Â in the integration.We assume uniform priors π 0 on each edge.In high dimensions, calculating the joint of all network edge weights can become computationally infeasible, but we can circumvent this by instead considering the two-dimensional joint density of the edge weight under consideration and the likelihood, p(â ij , e −J ) and then integrating over the likelihood, We show the results in fig. 3. Given the marginal distributions ρ (â ij | T) with modes ãij , we plot the densities on the four network edges with the highest relative prediction error The advantage of obtaining uncertainty quantification on the network is now immediately clear: even in the underdetermined case we are able to make meaningful statistical statements about the line failure location.We see that the missing edges consistently have the highest relative prediction errors, and that the p-values for measuring the unperturbed value a 0 ij under the null âij are 0.2% and 0.04% respectively, while being statistically insignificant for all other edges.It is interesting to note that the other candidate locations are also within the vicinity of the line failure, though their predicted values are much closer to the unperturbed value.In fig.3b, we see that the predicted network reproduces the response dynamics for the range covered by the training data when inserted into eq.[3], but, since the problem was purposefully underdetermined, the errors in the prediction Â cause the predicted and true time series to diverge for larger t.Densities on all 200.000 potential edges were obtained in about twenty minutes on a regular laptop CPU.

Inferring economic cost networks from noisy data
In the previous example the underlying network was a physical entity, but in many cases networks model abstract connections.We therefore now consider a commonly used economic model of the coupling of supply and demand [14,15,55] and a dataset of economic activity across Greater London.The goal is to learn the entire coupling network, not just to infer the (non-)existence of individual edges.In the model, N origin zones of sizes O i , representing economic demand, are coupled to M destination zones of sizes W j , modelling the supply side, through a network whose weights quantify the convenience with which demand from zone i can be supplied from zone j: the higher the weight, the more demand flows through that edge (see fig. 4a).Such a model is applicable e.g. to an urban setting [14], the origin zones representing residential areas, the destination zones e.g.commercial centres, and the weights quantifying the connectivity between the two (transport times, distances, etc.).The resulting cumulative demand at destination zone j depends both on the current size W j (t) of the destination zone and the network weights c ij : The sizes W j are governed by a system of M coupled logistic Stratonovich stochastic differential equations with given initial conditions W j (0), see fig.  with strength σ ≥ 0, where the ξ j are independent white noise processes and • signifies Stratonovich integration [56].Crucially, the model depends non-linearly on C. We apply this model to a previously studied dataset of economic activity in Greater London [15,31].We use the ward-level household income from N = 625 wards for 2015 [53] and the retail floor space of the M = 49 largest commercial centres in London [52] as the initial origin zone and destination zone sizes respectively, i.e.O(0) and W(0), and from this generate a synthetic time series using the parameters estimated in [31] for a high noise level of σ = 0.14.For the network C we use the Google Distance Matrix API1 to extract the shortest travel time d ij between nodes, using either public transport or driving.The network weights are derived in [57] as where the scale factor τ = max i,j d ij ensures a unitless exponent.
We generate a synthetic time series of 10000 time steps, from which we subsample 2500 2-step windows, giving a total training set size of L = 5000 time steps.This is to ensure we sample a sufficiently broad spectrum of the system's dynamics, thereby fully determining the inference problem and isolating the effect of the training noise.A hyperparameter sweep on synthetic data showed that using a neural network with 2 layers, 20 nodes per layer, and no bias yields optimal results.We use the hyperbolic tangent as the activation function on all layers except the last, where we use the standard sigmoid function (since the network is complete, there is no need to use the hard sigmoid as all edge weights are nonzero).To train the neural network we use the simple loss function J = ∥ T( Â) − T∥ 2  2 , where T and T are the predicted and true time series of destination zone sizes.Since the dynamics are invariant under scaling of the cost matrix C → λC, we normalise the row sums of the predicted and true networks, Figure 4c shows the inferred distribution ρ(k) of the (weighted) origin zone node degrees k j = i c ij .The solid line is the maximum likelihood prediction, and the dotted red line the true distribution .Even with a high level of noise, the model manages to accurately predict the underlying connectivity matrix, comprising over 30.000 weights, in under 5 minutes on a regular laptop CPU.Uncertainty on P (k) is given by the standard deviation, where P is the maximum likelihood estimator.As we will discuss in the last section, this method meaningfully captures the uncertainty due to the noise in the data and the degree to which the problem is underdetermined.

Comparative performance analysis
We now analyse our method's performance, both in terms of prediction quality and computational speed, by comparing it to a Markov-Chain Monte Carlo approach (MCMC) as well as a classical regression method, presented e.g. in [6,58].As mentioned in the introduction, computationally efficient network learning methods have been developed for specific data structures; however, we compare our approach with MCMC and OLS since both are general in the types of data to which they are applicable.
Consider noisy Kuramoto dynamics, with ξ i independent white noise processes with strength σ, and ω i the eigenfrequencies of the nodes.Given L observations of each node's dynamics, we can gather the left side into a single vector X i ∈ R 1×L for each node, and obtain N equations with A i ∈ R 1×N the i-th row of the adjacency matrix A, and G i ∈ R N ×L the L observations of the the interaction terms sin(φ j − φ i ), j = 1, ..., N .From this we can then naturally estimate the i-th row of A using ordinary least squares: Given sufficiently many linearly independent observations, the Gram matrices G i G ⊤ i will all be invertible; in the underdetermined case, a pseudoinverse can be used to approximate their inverses.As before, the diagonal of Â is manually set to 0.
In addition to regression, we also compare our method to a preconditioned Metropolis-adjusted Langevin sampling scheme (MALA) [27][28][29]59], which constructs a Markov chain of sampled adjacency matrices Â by drawing proposals from the normal distribution Here, τ > 0 is the integration step size, P ∈ R N 2 ×N 2 a preconditioner (note that we are reshaping Â into an N 2 -dimensional vector), and λ = tr(P)/N 2 its average eigenvalue.Each proposal is accepted with probability with the transition probability We tune τ so that the acceptance ratio η converges to the optimum value of 0.57 [60].
We set the preconditioner P to be the inverse Fisher information covariance matrix which has been shown to optimise the expected squared jump distance [29].The expectation value is calculated empirically over all samples drawn using the efficient algorithm given in [29].In all experiments, we employ a 'warm start' by initialising the sampler close to the minimum of the problem.We found this to be necessary in such high dimensions (between 256 and 490.000) to produce decent results.Unlike the MCMC sampler, the neural network is initialised randomly.Figures 5a-b show our method's prediction accuracy alongside that of OLS regression and preconditioned MALA on synthetic Kuramoto data; the accuracy here is defined as the L 1 error where Â is the mode of the posterior.In fig.5a, the accuracy is shown as a function of the noise σ on the training data.We generate enough data to ensure the likelihood function is unimodal.For the practically noiseless case of σ < 10 −5 , the regression scheme on average outperforms the neural approach; however, even for very low noise levels σ ≥ 10 −5 and above, the neural approach proves far more robust, outperforming OLS by up to one order of magnitude and maintaining its prediction performance up to low noise levels of σ ≤ 10 −3 .Meanwhile, we find that in the low-to mid-level noise regime, the neural scheme approximates the mode of the distribution by between one to two orders of magnitude more accurately than the Langevin sampler.For high levels of noise (σ > 10 −2 ), the performances of the neural and MALA schemes converge.These results hold both for first-order (α = 0) and second-order Kuramoto dynamics [3]; in the second-order case, the neural method begins outperforming OLS at even lower levels of σ than in the first-order case, though the improvement is not as significant (cf fig.S6 in the appendix).
In figure 5b we show the accuracy as a function of the convexity of the loss function.In general, it is hard to quantify the convexity of J, since we do not know how many networks fit the equation at hand.However, when the dynamics are linear in the adjacency matrix A, we can do so using the Gram matrices of the observations of each node i, G i G ⊤ i , where we quantify the (non-)convexity of the problem by the minimum rank of all the Gram matrices, The problem is fully determined if c = N − 1 and all Gram matrices are invertible.As shown, regression is again more accurate when the problem is close to fully determined; however, as c decreases, the accuracy quickly drops, with the neural scheme proving up to an order of magnitude more accurate.Meanwhile, the MCMC scheme is consistently outperformed by the neural scheme, though it too eclipses regression for c < 0.75.In summary, regression is only viable for the virtually noiseless and fully determined case, while the neural scheme maintains good prediction performance even in the noisy and highly underdetermined case (see also fig.5d-e).
In figure 5c we show compute times to obtain 100 samples for both the neural and MALA schemes.The complexity of the neural scheme is O(n E × LN 2 ), with n E the number of training epochs.This is because each epoch of the model equation requires O(LN 2 ) operations for the vector-matrix multiplication in eq.[11], and O(LN 2 /B) for the stochastic gradient descent update, where we are holding L/B constant to ensure comparability.As is visible, the average L 1 error per edge weight remains constant over N , showing that the number of epochs required to achieve a given node-averaged prediction accuracy is independent of N .The preconditioned MALA scheme is considerably slower, due to the computational cost of calculating the preconditioner and the Metropolis-Hastings rejection step.
Lastly, figures 5d-e show the estimated weighted degree and triangle distributions of a large graph with 1000 nodes, or 1 million edge weights to be estimated, for noisy training data.The number of weighted, undirected triangles on each node i is given by 1 2 jk a ij a jk a ki .The model robustly finds the true adjacency matrix, and we again quantify uncertainty on the prediction using the standard deviation eq.[8].Estimating a network with 1000 nodes on a standard laptop CPU took about 1 hour, which reduces to 6 minutes when using a GPU.Most high-performance network inference techniques demon-  [16] of the neural scheme, the preconditioned Metropolis-adjusted Langevin sampler, and OLS regression as a function of the noise variance σ on the training data.For very high noise levels, the training data is essentially pure noise, and the prediction errors begin to plateau.First-order Kuramoto dynamics are used (α = 0), though these results also hold for second-order dynamics (cf.fig S6 in the appendix).Enough data is used to ensure full invertibility of the Gram matrix (c = 1).(b) The L 1 accuracy as a function of the convexity c of the loss function (eq.[17]).(c) Compute times for ten epochs, or 100 samples, of the neural scheme and the preconditioned Metropolis-adjusted Langevin sampler (MALA), averaged over 10 runs.The shaded areas showing one standard deviation.On the right axis, the average L 1 prediction error of the neural scheme 1 N ∥ Â − A∥ 1 after 10 epochs is shown, which remains fairly constant as a function of N , showing that the number of gradient descent steps required to achieve a given average prediction error does not depend on N .strate their viability on graphs with at most this number of nodes, e.g.ConNIe [19] and NetINF [21].In [19], the authors state that graphs with 1000 nodes can typically be inferred from cascade data in under 10 minutes on a standard laptop.Similarly, the authors of NetINF [21] state that it can infer a network with 1000 nodes in a matter of minutes, though this algorithm does not infer edge weights, only the existence of edges, and neither technique provides uncertainty quantification.

Quantifying uncertainty
There are two sources of uncertainty when inferring adjacency matrices: the non-convexity of the loss function J, and the noise σ on the data.In figure 6a we show the expected Hellinger error  [18]) on the degree distribution P (k) as a function of c (eq. [17]) in the noiseless case.The error is normalised to the value at c = 0.21(N − 1).As c increases, the error on the prediction decreases almost linearly.We run the model from 10 different initialisations and average over each (shaded area: standard deviation).(b) and (c): Prediction uncertainty due to noise in the data.Shown are the expected Hellinger error (eq.[18]) and expected relative entropy (eq.[19]) to the maximum likelihood estimate, as well as the total standard deviation s, eq.[8], for the degree distribution P (k) and triangle distribution P (t) as a function of the noise σ on the data.Each line is an average over 10 different initialisations.In all cases, training was conducted on synthetic, first-order Kuramoto data (eq.[9] with α = 0).on the predicted degree distribution as a function of c.As is visible, the error on the distribution decreases as c tends to its maximum value of N −1.For c = N −1, some residual uncertainty remains due to the uncertainty on the neural network parameters θ.
In figures 6b-c we show the expected Hellinger error (eq.[18]) on the maximum likelihood estimator P as a function of σ, for both the degree and triangle distributions, i.e. x ∈ {k, t}.In addition, we also show the behaviour of the expected relative entropy and the total standard deviation All three metrics reflect the noise on the training data, providing similarly behaved, meaningful uncertainty quantification.As the noise tends to 0, some residual uncertainty again remains, while for very high noise levels the uncertainty begins to plateau.Our method thus manages to capture the uncertainty arising from both sources: the non-convexity of J and the noise σ on the data.

Discussion
In this work we have demonstrated a performative method to estimate network adjacency matrices from time series data.We showed its effectiveness at correctly and reliably inferring networks in a variety of scenarios: convex and non-convex cases, low to high noise regimes, and equations that are both linear and non-linear in A. We were able to reliably infer power line failures in the national power grid of Great Britain, and the connectivity matrix of an economic system covering all of Greater London.We showed that our method is well able to handle inference of hundreds of thousands to a million edge weights, while simultaneously giving uncertainty quantification that meaningfully reflects both the nonconvexity of the loss function as well as the noise on the training data.Our method is significantly more accurate than MCMC sampling, and outperforms OLS regression in all except the virtually noiseless and fully determined cases.This is an important improvement since large amounts of data are typically required to ensure the network inference problem is fully determined, which may often not be available, as suggested in the power grid study.Unlike regression, our method also naturally extends to the case of non-linear dynamics.In conjunction with our previous work [31], we have now also demonstrated the viability of using neural networks for parameter calibration in both the low-and highdimensional case.Our method is simple to implement as well as highly versatile, giving excellent results across a variety of problems.All experiments in this work were purposefully conducted on a standard laptop CPU, typically taking on the order of minutes to run.Many lines for future research open up from this work.Firstly, a thorough theoretical investigation of the method is warranted, establishing rigorous convergence guarantees and bounds on the error of the posterior estimate.Another direction is further reducing the amount of data required to accurately learn parameters, and in future research the authors aim to address the question of learning system properties from observations of a single particle trajectory at the mean-field limit [61,62].In this work we have also not considered the impact of the network topology on the prediction performance, rather focusing on the physical dynamics of the problem.An interesting question is to what degree different network structures themselves are amenable to or hinder the learning process.
Over the past decade much work has been conducted into graph neural architectures [63,64], the use of which may further expand the capabilities of our method.More specialised architectures may prove advantageous for different (and possibly more difficult) inference tasks, though we conducted a limited number of experiments with alternatives (e.g.autoencoders, cf.fig.S4) and were unable to find great performance improvements.Finally, one drawback of our proposed method in its current form is it that it requires differentiability of the model equations in the parameters to be learned; future research might aim to develop a variational approach to expand our method to weakly differentiable settings.Data, materials, and Software Availability Code and synthetic data can be found under https://github.com/ThGaskin/NeuralABM.It is easily adaptable to new models and ideas.The code uses the utopya package2 [65,66] to handle simulation configuration and efficiently read, write, analyse, and evaluate data.This means that the model can be run by modifying simple and intuitive configuration files, without touching code.Multiple training runs and parameter sweeps are automatically parallelised.The neural core is implemented using pytorch 3 .All synthetic datasets as well as the London dataset have been made available, together with the configuration files needed to reproduce the plots.Detailed instructions are provided in the supplementary material and the repository.The British power grid data [46][47][48] is property of the respective organisations and cannot be made available without permission; however, as of early 2023 it is freely available from those organisations upon request.The code used to run the experiments is available in the repository.

Neural networks Notation and Terminology
A neural network is a sequence of length L ≥ 1 of concatenated transformations.Each layer of the net consists of L i neurons, connected through a sequence of weight matrices W i ∈ R L i+1 ×L i .Each layer applies the transformation to the input x from the previous layer, where b i ∈ R L i+1 is the bias of the i-th layer.The function σ i : R L i+1 → R L i+1 is the activation function; popular choices include the rectified linear unit (ReLU) activation function σ(x) = max(x, 0), and the sigmoid activation function σ(x) = (1 + e −x ) −1 .A neural net has an input layer, an output layer, and hidden layers, which are the layers in between the in-and output layers.If a network only has one hidden layer, we call it shallow, else we call the neural net deep.
Hidden layers

Constructing the admittance matrix
Here we provide some additional information on the calculation of the network edge weights from the data.The weights quantify the admittance (inverse impedance) of the line, that is, how easily the line can transmit electrical current.The characteristic impedance of a transmission line is given by with i the complex unit, Ω = 50 Hz the grid frequency [67], L the cable inductance, and C its capacitance.We use the following values for a copper conductor with a cross-section of 1000 mm 2 , provided by a standard manufacturer4 of power grid cables: R = 0.0276 Ω/km (AC resistance), L = 0.41 mH/km, C = 0.150 µF/km (assuming a single core).The total impedance along the line is then given by where l is the length of the line, and the propagation constant γ is given by In these calculations we neglect the conductance of the dielectric along the line, which is neglibile at the distances present in the network.The admittance of the line is then given by and the total edge weight a ij by Long lines typically carry two to four times as many cables as shorter lines.We account for this by multiplying the admittance of lines over 10 km by a factor of 2, and those over 80 with a factor of 3 (admittances are additive).
Since the neural network outputs values in [0, 1], we scale the edge weights to this range.However, a small number of short lines artificially skew the distribution (see figure S5); these are mainly small segments designed to more accurately capture the geometry of a longer line, though sometimes they represent short lines to transformers, power stations, etc.To reduce their impact on the weight distribution, instead of dividing the weights by the maximum value, we divide by the mean and truncate all weights to [0, 1]: The scaling factor is absorbed into the coupling coefficient κ.Those edges with weight exactly equal to 1 are reassigned a value chosen uniformly at random in [0.9, 1.0].As is visible in fig.S5d, the resulting distribution of the weights a ij is more uniform than that of the distances l ij and admittances Y ij , since longer lines' lower admittance is often compensated by a higher voltage.These calculations do not account for, among other things, the fact that we are connecting the nodes with straight lines rather than the real line trajectory, are not discerning between overground and underground lines, are assuming a single core per cable, and are not considering differences in transmission line heights, geometries, and materials.These inaccuracies are absorbed into the coefficients α and β, which we tune manually in order to allow for stable phase-locking.In figure S6 we show the equivalent of fig.5a for the case of secondorder Kuramoto dynamics (eq.[3] with α = 1).We again see the neural scheme outperforming OLS regression at very low levels of the noise, though the performance improvement is less stark than in the first-order case.MALA marginally outperforms the neural scheme at very high noise levels.Enough data is used to ensure full invertibility of the Gram matrix (c = N − 1).

Details on the code
The code is uploaded to the Github repository as given in the main text.The two models relevant to this work are Kuramoto and HarrisWilsonNW.

Installation
Detailed installation instructions are given in the repository.First, clone the repository, install the utopya package and all the required additional components into a virtual environment, for example via PyPi.In particular, install pytorch.Enter the virtual environment.Then, from within the project folder, register the project: utopya projects register .
You should get a positive response from the utopya CLI and your project should appear in the project list when calling: utopya projects ls Note that any changes to the project info file need to be communicated to utopya by calling the registration command anew.You will then have to additionally pass the -exists-action overwrite flag, because a project of that name already exists.See utopya projects register --help for more information.Finally, register the project and its models via utopya projects register .--with-models

Running the code
To run a model, execute the following command: This is recommended rather than changing the default settings, because the defaults are parameters that are known to work and you may wish to fall back on in the future.Plots are generated using the plots specified in the <model_name>_plots.ymlfile.These too can be updated by creating a custom plot configuration, and running the model like this: utopya run <model_name> path/to/run_cfg.yml--plots-cfg path/to/plot_cfg.ymlSee the Utopia tutorial for more detailed instructions.
All the images in this article can be generated using so-called configuration sets, which are complete bundles of both run configurations and evaluation configurations.For example, to generate the predictions on the random network with N = 1000 nodes (fig.5d-e) for the Kuramoto model, you can call utopya run Kuramoto --cfg-set N_1000_example This will run and evaluate the Kuramoto model with all the settings from the Kuramoto/cfgs/N_100_example/run.yml and eval.ymlconfigurations.

Parameter sweeps
Parameter sweeps are automatically parallelised by utopya, meaning simulation runs are always run concurrently whenever possible.The data is automatically stored and loaded into a data tree.To run a sweep, simply add a !sweep tag to the parameters you wish to sweep over, and specify the values, along with a default value to be used if no sweep is performed: Then in the run configuration, add the following entry: perform_sweep: true Alternatively, call the model with the flag --run-mode sweep.The configuration sets used in this work automatically run sweeps whenever needed, so no adjustment is needed to recreate the plots used in this work.

Initialising the neural net
The neural net is controlled from the NeuralNet entry of the configuration: --2 # min_value -+2 # max_value learning_rate: 0.001 # optional; default is 0.001 optimizer: SGD # optional; default is Adam num_layers specifies the depth of the net; nodes_per_layer controls the architecture: provide a default size, and optionally any deviations from the default under layer_specific.The keys of the layer_specific entry should be indices of the layer in question.The optional biases entry determines whether or not biases are to be included in the architecture, and if so how to initialise them.A default and layer-specific values can again be passed.Setting an entry to default initialises the values using the pytorch default initialiser, a Xavier uniform initialisation.Passing a custom interval instead initialises the biases uniformly at random on that interval, and passing a tilde ~(None in YAML) turns the bias off.activation_funcs is a dictionary specifying the activation functions on each layer, following the same logic as above.Any pytorch activation function is permissible.If a function requires additional arguments, these can be passed as in the example above.Lastly, the optimizer keyword takes any argument allowed in pytorch.The default optimizer is the Adam optimizer [39] with a learning rate of 0.02.The neural net can be initialised from different initial values in the parameter space by changing the random seed in the configuration: Sweeping over different initialisations is achieved by sweeping over the seed, as described in the previous section.

Training the neural net
The Training entry of the configuration controls the training process: The device key sets the training device.The default is the CPU, but you can also train on the GPU by setting the device to cuda.Note that on Apple Silicon, the device name is mps.Make sure you have installed the correct package for your device, and note that, as of writing, some pytorch functions required for our code (e.g., the trace and hard sigmoid functions) had not yet been implemented for MPS, hence GPU training on Apple Silicon devices was not possible.

Neural Network Architecture
The following is the default configuration for the neural network and training settings used for the Kuramoto model:

Running the code
Configuration sets to reproduce all numerical experiments except the British power grid inference are provided; for instance, to produce the predictions for a random network with N = 100 nodes, simply run utopya run Kuramoto --cfg-set N_1000_example

Figure 1 :
Figure 1: (a) Approximate high-voltage electricity transmission grid of Great Britain.Shown are 630 accurately placed nodes, representing power stations, substations, and transmission line intersections, and their connectivity as of January 2023 [46-48].Colours indicate the operating voltage of the lines.The size of the nodes indicate their power generation or consumption capacity (absolute values shown).White ringed nodes indicate the 38 nodes that are real power stations with capacities over 400 MW [45], with all other nodes assigned a random capacity in [−200, +200].The two dotted edges in the northeast of England are the edges affected by a simulated power cut, labelled by the indices of their start and end vertices.(b) The network response to the simulated power line failure, measured at four different nodes in the network (marked A-D).The equation parameters were tuned to ensure phase-locking of the oscillators (α = 1, β = 0.2, κ = 30).Nodes closer to the location of the line cut (A and B) show a stronger and more immediate response than nodes further away (C and D).The shaded area indicates the 4-second window we use to infer the line location.

Figure 2 :
Figure 2:The total loss J and its derivatives with respect to the iteration count ∂sJ and ∂ssJ, averaged over a window of 20 iterations (absolute values shown).The red dotted line indicates the value at which ν is set to 0.

Figure 3 :
Figure 3: Estimating the line failure location.(a) The densities on four edges with the highest relative prediction error |ã ij − a 0 ij |/a 0 ij and their respective p-values for measuring the unperturbed value a 0 ij (ã ij is the prediction mode).Red dotted lines indicate the values of the unperturbed network, green lines the expectation values of the distributions.The marginals are smoothed using a Gaussian kernel.We use a training set of length L = 400 steps, and the batch size is B = 2. CPU runtime: 24 minutes.(b) True (black) and predicted network responses at three different locations in the network.The responses are each normalised to the value at t = 0.The shaded area represents the 400 time steps used to train the model.While the model is able to perfectly fit the response within the training range, it is not able to learn the full network from insufficient data, causing the time series to diverge for larger t.
4a. α, β, κ, and ϵ are scalar parameters.Our goal is to infer the cost matrix C = (c ij ) from observations of the time series O(t) and W(t).The model includes multiplicative noise Initial data and travel times network (c) Inferred weighted degree distribution 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.

Figure 4 :
Figure 4: Inferring economic cost networks.(a) In the model, N origin zones (red) are connected to M destination zones (blue) through a weighted directed network.Economic demand flows from the origin zones to the destination zones, which supply the demand.We model the origin zones O i (t) as a Wiener process with diffusion coefficient σ O = 0.1.The resulting cumulative demand at destination zone j is given by W j .Note that the origin zone sizes fluctuate more rapidly than the destination zones, since there is a delay in the destination zones' response to changing consumer patterns, controlled by the parameter ϵ.We use the parameters as estimated in[31], α = 0.92, β = 0.54, κ = 8.3, and set ϵ = 2. (b) The initial origin and destination zone sizes, given by the total household income of the N = 629 wards in London (blue nodes) and the retail floor space of M = 49 major centres (red nodes)[52,53].The network is given by travel times as detailed in the text.Background map:[54].(c) Predicted degree distribution (sold line) of the inferred network, for a high noise level of σ = 0.14, and one standard deviation (shaded area), and the true distribution (red dotted line).CPU runtime: 3 min 41 s.

( a )Figure 5 :
Figure 5: Computational performance analysis.(a) L 1 prediction error eq.[16] of the neural scheme, the preconditioned Metropolis-adjusted Langevin sampler, and OLS regression as a function of the noise variance σ on the training data.For very high noise levels, the training data is essentially pure noise, and the prediction errors begin to plateau.First-order Kuramoto dynamics are used (α = 0), though these results also hold for second-order dynamics (cf.fig S6 in the appendix).Enough data is used to ensure full invertibility of the Gram matrix (c = 1).(b) The L 1 accuracy as a function of the convexity c of the loss function (eq. (d) Predicted degree distribution and (e) triangle distribution of an inferred network with N = 1000 nodes, trained on first-order noisy Kuramoto data (σ = 0.001).The blue shaded areas indicate one standard deviation, and the red dotted lines are the true distributions.CPU runtime: 1 hour 3 minutes.

( a ) 2 Figure 6 :
Figure 6:Quantifying the two types of uncertainty: (a) Hellinger error (eq.[18]) on the degree distribution P (k) as a function of c (eq.[17]) in the noiseless case.The error is normalised to the value at c = 0.21(N − 1).As c increases, the error on the prediction decreases almost linearly.We run the model from 10 different

Figure S1 :
Figure S1:Example of a deep neural network with 3 hidden layers.The inputs (light blue nodes) are passed through the layers, with links between layers representing the weight matrices W. Each layer also applies a bias (yellow nodes), with the network finally producing an output (orange).

Figure S5 :
Figure S5: Line data statistics for the British power grid.(a) Histogram of the node distances in the network, with mean, standard deviation, and median given.(b) Histogram of the line voltages.(c) Histogram of the line admittances.As is visible, a small number of short lines artifically skew the distribution.(d) The resulting normalised edge weights.Short edges with an artificially high admittance are assigned a random value in [0.9, 1].

10 − 6 L 1
Figure S6:L 1 prediction error of the neural scheme, the preconditioned Metropolis-adjusted Langevin sampler, and OLS regression as a function of the noise variance σ on the training data, for second-order Kuramoto dynamics (α = 1).
By default, this runs the model with the settings in the <model_name>_cfg.ymlfile.All data and the plots are written to an output directory, typically located in ~/utopya_output.To run the model with different settings, create a run_cfg.ymlfile and pass it to the model like this: utopya run <model_name> path/to/run_cfg.yml additional args and kwargs here ...You must specify the noise level to use for the ABM during training; the default value is 0. Under the loss_function key you can specify the loss function to use, and pass any arguments or keyword arguments it may require using an args or kwargs key.You can use any available pytorch loss function.