Dynamic latent space relational event model

Dynamic relational processes, such as e-mail exchanges, bank loans and scientific citations, are important examples of dynamic networks, in which the relational events consistute time-stamped edges. There are contexts where the network might be considered a reflection of underlying dynamics in some latent space, whereby nodes are associated with dynamic locations and their relative distances drive their interaction tendencies. As time passes nodes can change their locations assuming new configurations, with different interaction patterns. The aim of this paper is to define a dynamic latent space relational event model. We then develop a computationally efficient method for inferring the locations of the nodes. We make use of the Expectation Maximization algorithm which embeds an extension of the universal Kalman filter. Kalman filters are known for being effective tools in the context of tracking objects in the space, with successful applications in fields such as geolocalization. We extend its application to dynamic networks by filtering the signal from a sequence of adjacency matrices and recovering the hidden movements. Besides the latent space our formulation includes also more traditional fixed and random effects, achieving a general model that can suit a large variety of applications.


Introduction
Networks appear in many contexts. Examples include gene regulatory networks (Signorelli et al., 2016), financial networks (Cook and Soramaki, 2014), psychopathological symptom networks (De Vos et al., 2017), political collaboration networks (Signorelli and Wit, 2018), and contagion networks (Užupytė and Wit, 2020). Studying networks is important for understanding complex relationships and interactions between the components of the system. The analysis can be difficult due to the many endogenous and exogenous factors that may play a role in the constitution of a network. The aim of statistical modelling in this context is to describe the underlying generative process in order to assist in identifying drivers of these complex interactions. These models can assist in learning certain features of the process, filtering noise from the data, thereby making interpretation possible.
In this manuscript we are considering temporal random networks, whereby nodes make instantaneous time-stamped directed or undirected connections. Examples are email exchanges, bank loans, phone calls, article citations. A common approach to these networks has been flattening the time variable and studying the resulting static network. Although this method simplifies the complexity of the calculations, clearly there is a loss of information about the temporal structure of the process. Most networks are inherently dynamic. Subjects repeatedly create ties through time. Since the adjustment of ties is influenced by the existence and non-existence of other ties, the network is both the dependent and the explanatory variable in this process (Brandes et al., 2009). Thus rather than viewing this as a static network, we consider the generative process as a network structure in which the actors interact with each other through the time. Edges are defined as instantaneous events. This quantitative framework is known as relational event modelling.
The basic form of a relational event model as an event history model can be found in Butts (2008) with an application to the communications during the World Trade Center disaster. The model has been extended by Brandes et al. (2009) to weighted networks: nodes involved in these events are actors, such as countries, international organizations or ethnic groups. An event is assigned a positive or negative weight depending on a cooperative or hostile type of interaction, respectively. Other examples of relational event modelling include the work by Vu et al. (2017) on interhospital patient transfers within a regional community of health care organizations or the analysis of social interaction between animals (Tranmer et al., 2015).
In a relational event model the connectivity may depend on the past evolution of the network. Keeping track of the past is challenging for dynamic networks because of the high number of possible configurations (k-stars, k-triangles, etc.) that could be taken into account, as well as their closure time and the time they keep affecting future configurations. We thus propose to take some kind of summary of the past configurations. A solution that can both summarize the process and approximate effectively the past information is the idea of a dynamic latent space. To describe the latent structure of a network one can think of placing the vertices in a space where the distance between two points describes the tendency or lack of tendency to connect. Among social scientists this is typically called a social space where actors with more interactions are close together and vice versa (Bourdieu, 1989). The locations are allowed to change in time. At each time point new connections are formed and the subjects develop attraction/repulsion that force them to change their social space configuration. The new configuration is the one that best reflect the new connectivity behavior. As a result one location at a certain time reflects past information, within the limits of the latent space formulation. This evolution describes the social history of the subjects, their preferences, and the groups they might join or leave.
The problem of tracking latent locations has been studied by many authors specifically for the static case, i.e., tracking locations under the assumption that they are fixed over time. For static binary networks Hoff et al. (2002) provide a framework for inference. Some extensions of that model has been developed to overcome the limitations of the latent space formulation (Hoff, 2005(Hoff, , 2008(Hoff, , 2009. Similar to the latent space is the stochastic block model that describes the similar-ity between the actors by grouping them together. An extension of stochastic block modelling to relational event data is provided by DuBois et al. (2013). An approach for modelling a latent space on dynamic binary networks was proposed by Sarkar and Moore (2005). The method is based on a first preprocessing phase where raw location estimation are provided trough Multidimensional Scaling. In the estimation phase they treat the dynamic locations as fixed parameters and optimize them via a conjugate gradient approach. The distances between nodes are approximated by cutting off the larger ones and including an additional penalty for forcing distant nodes to be closer. In our work, we aim to avoid making ad hoc assumptions. Sewell and Chen (2015) developed a dynamic latent space with node specific parameters that regulate the incoming and outgoing links. The inference is performed via Metropolis Hastings algorithm. Instead, we use a Kalman filter, which is computationally more efficient.
Durante and Dunson (2016) developed a Bayesian model using a Polya-Gamma data augmentation for binary connections and Gaussian processes for parameter dynamics, with a non-Euclidean dissimilarity measure. Instead, we tackle the problem from a frequentist perspective providing a method which does not require data augmentation. Moreover, rather than embedding the dynamic latent space into a GLM, we embed it in a relational event model. Although non-Euclidean alternatives are possible, in our application we focus on an easily interpretable Euclidean latent space. Furthermore, our method can be applied to networks with non-binary links that are distributed according to any exponential family distribution.
In section 2 we present several formulations of the latent space relational event model. In section 3 we propose an efficient inference method that is based on combing the statespace formulation of the model with the EM algorithm. In section 4 we check the performance and limitations of our method via simulations. In section 5 we aim to discover the latent structure of technological innovation, by studying over 23 million patent citations from 1967 until 2006.

Latent space relational event models
In this section we introduce a general version of a latent space relational event model. We consider a set of actors, defined as a finite vertex set V = {1, . . . , p}, that can exchange links or edges in time. In principle, we will consider the exchange of relational events, such as discrete interaction, e.g., sending an email or citing a patent, but we will also consider extensions to the quantitative exchanges, such as import and export. As drivers of the exchange process we consider both endogenous, such as reciprocity, and exogenous variables, such as vertex characteristics. One particular exogenous variable is the relative location of the vertices in some Euclidean latent space, which itself is defined as a dynamic process.
We consider a non-homogeneous multivariate Poisson counting process . . , p} relative to some standard filtration F. In particular, we consider F-measurable rate functions λ ij (t) that drive the components of the counting process. In particular, we assume that the rates λ ij (t) are functions of the underlying positions X i (t) and X j (t), besides possible other exogenous characteristics B ij (t) and endogenous features N(t), for some measurable function g. Two common choices for the way that the rate depends on the locations is either as function of the squared distance, between i and j (Hoff et al., 2002). The former induces a symmetric interpretation, where the latter allows for a more complex asymmetric interpretation of the state-space. The interaction dynamics itself can be highly structured and parametrized, i.e., g = g θ , whereas the state-space dynamics is assumed to be a random walk at equally spaced with v k ∼ N (0, Σ) and t x 0 = 0. The covariance matrix Σ regulates the evolution of the latent process: a large variance allows longer jumps. Given the joint formulation (X, N) of the state-space and interaction process, we will assume that only the interaction process N is observed and the main aim of this paper is to infer the structure of the state-space X and the rate functions λ, or more specifically, the parameter β associated with functional form λ = g β .
Next, we will consider two particular special cases of the latent space formulation of the interacting point process defined above. First we consider the general case, in which the relational events are observed in continuous time. This is the traditional setting for relational events. We will also define a relational event model where the interactions can only happen at specific times. For example, bibliometric citations or patent citations only happen at prespecified publication dates. Furthermore, this model allows a generalization to non-binary relational events, such as export between countries, that can be dealt with in the same inferential framework.

Continuous time relational event process N
We consider a sequence of n relational events, {(i 1 , j 1 , t 1 ), . . . , (i n , j n , t n ) | t i ∈ [0, T ], i, j ∈ V } observed according to the above defined relational counting process N. In a latent space relational event model, the rate is defined as where the latent space effect d(X i (t), X j (t)) that captures the "vicinity" of the actors. The drivers of the network dynamics can be of various type: exogenous effects, such as global covariates, node covariates, edge covariates, as well as endogenous effects, where network statistics s() capture endogenous quantities such as popularity, reciprocity, and triadic closure. The parameter vector β determines the relative importance of the various effects. Conditional on the process X, the distribution of the lth interarrival time ∆t ij,l = t kij,l − t kij,l−1 for interaction i → j are generalized exponentials, with rates where k ij,l ∈ {1, . . . , n} is the time indicator of the lth occasion where i → j happened.
The full log-likelihood of the complete process {X, N}, can be factorized in two components, Although it is common in the REM literature to simplify inference by using the partial likelihood, we keep the generalized exponential component, as it can be estimated more easily in the M-step of the EM algorithm, described in section 3.
2.2. Discrete time relational event process N If the relational events are "published" only on prespecified discrete event times T = {t e 1 , . . . , t e n }, we will make an additional assumption that the rate λ is constant with respect to the endogenous and exogenous variables inside the collection intervals (t e k , t e k+1 ]. In fact, with respect to the endogenous variable N it makes sense that no further information between the publication dates affects the rates. In other words, assuming a log link for the hazard, for t ∈ (t e k , t e k+1 ] As the interactions i → j are collected at t e k+1 from the observation intervals (t e k , t e k+1 ], the resulting interval counts of the number of interactions between i and j are Poisson distributed with rate, As long as the collection time process {t e k } is finer than or equal to the change process {t x k } of the latent process, we obtain a discrete-time relational event process, i.e., µ ij (k) = (t e k+1 − t e k )λ(t e k ). An advantage of using discrete time is the reduction of the model complexity. It is not uncommon to observe thousands, even million of links. Such numbers are not surprising when we consider p(p − 1) processes having an expected number of links E[ p(p−1) N ij (t)] that grows rapidly. For simplicity of notation we will assume that the relational event collection process and the jumps of the latent space are equal and unitary, The model can be written as a discrete-time state space process, where v k ∼ N (0, Σ). Given the observations Y = y and X = x, the complete loglikelihood for the state space model in (5) can again be factorized in two components, where log p β (Y|X) = − kij µ ij (k) + kij y ij (k) log µ ij (k) and log p Σ (X) as above, where the factorization is according to the directed graph in Figure 1, where y k ⊥ y −k , x −k |x k and x k+1 ⊥ x k−1 |x k . Similar to Butts (2008) and Perry and Wolfe (2013), who focused on non-homogeneous exponential waiting times, this approach focuses on non-homogeneous Poisson counts. One advantage of the latent space formulation is the dimensionality reduction in the latent representation. As the number of nodes p increases the number of observed counts p(p − 1)n grows quadratically while the latent space grows linearly as pdn.
Dynamic exponential family network model. Given the state space formulation in (5), it is possible to generalize the model considering connections drawn from any exponential family distribution without changing the inference procedure. In fact, ignoring the connection with any underlying counting process, we could define a temporal network process on discrete time intervals k (k ∈ {1, . . . , n}) between nodes i and j as f (y ij (k)) = exp((y ij (k)θ − b(θ))/a(ϕ) + c(y ij (k), ϕ), where θ is the edge-specific canonical parameter. Using the canonical link function, we can specify the canonical parameter in a similar fashion to (4), where the values for x are the latent states as before. It is also possible to add additional covariates, but we do not consider this case here. The inferential method presented in this manuscript remains mostly the same with a minimal change, effectively replacing the mean µ(x k ) and variance R k of the process by This generalized temporal network model can be used to model import and export or other dynamic networks with weighted edges.

Inference
In this section we develop all the necessary steps for making inference on the latent states x k and the parameters Σ an β. Since the latent process x k is unobserved we aim to maximize x L(β, Σ; y, x)dx. We use the Expectation Maximization (EM) algorithm (Dempster et al., 1977). EM algorithm is widely used in problems where certain variables are missing or latent. The EM algorithm consists of an iterative maximization of the conditional expectation of the latent process X|N, β, Σ with respect to the data. Due to the stepwise dynamic of the latent locations (1) the expectation step is equivalent for both models presented in Section (2.1) and Section (2.2). As the locations are constant within intervals T , the continuous time non-homogeneous exponential relational event model N reduces to a discrete time Poisson model Y during the E-Step.
where β * , Σ * denote the parameters estimated at the previous EM iteration. In the maximization step Q(β, Σ|β * , Σ * ) is maximized with respect to the parameters β, Σ. The two steps above are iterated until convergence is reached. The expectation step is typically challenging due to the high dimensional nature of the integral.
The expectation of the log-likelihood can approximately be written as a function of the first two conditioned moments E[x k |y 1:n ] and V[x k |y 1:n ]. Exploiting the state space formulation of the model (5) we can estimate these two quantities with a Kalman filter and smoother (Kalman, 1960). The filter derives mean and variance of the latent process x k conditioned to the information on y up to time k, The smoother refines these quantities accounting for the complete information on y up to time n,x The expected log-likelihood can be then calculated using these quantities obtained from the smoother.

E-Step: Extended Kalman Filter
The Kalman filter is one of the most popular algorithms for making inference on state space models and it provides a solution that is both computationally cheap and accurate. Kalman filter is an iterative method that calculates the conditional distribution of the latent x k . Given the causal DAG at Figure (1) x k depends on x k−1 and the observed y k . Assuming a prior knowledge on the distribution of x k−1 the conditional distribution of x k is calculated easily. The procedure is applied sequentially from time 1 to n, where the conditional distribution achieved at time k becomes the prior knowledge for the next time point. An arbitrary distribution is specified for the initial x 0 . Calculating the conditional distribution entirely could be difficult so the first moments are calculated only. The calculation of the conditional probability involves two steps that are universal in the filtering literature: predict and update. In order to be consistent to the forementioned literature we denotex k|k = E[x k |y 1:k ] and V k|k = V[x k |y 1:k ] as the expectation and variance conditioned of having observed y k .

Predict
Assume that at time k − 1 the approximated conditional distribution of the latent loca- . For the initial case k = 1 we set arbitrarily x 0|0 = v 0 and V 0|0 = Σ 0 . The predict step calculates the first moments of x k conditioned to y k−1 . In fields such physics, chemistry or engineering it is common to employ a forward function x k = f (x k−1 ) + v k which is related to the physical properties of the system. In our case the random walk formulation makes no constraints on the latent process evolution. The forward function is the identity with momentŝ These are called the apriori mean and variance of the latent locations before observing

Update
The update step finalizes the calculation of the conditional distribution. We consider V[y k ] = R k where counts are independent with variance equal to the mean R k = µ(x k , β) I py . In case a general dynamic network model using exponential family weighted edges, as described in Section (2.2), is considered then the mean µ(x k ) and variance R k vary accordingly. Kalman filters assume that the observed process y k is Gaussian and the transformations involved are linear. The Extended Kalman Filter (Anderson and Moore, 2012) overcomes the Kalman filter limitations. By means of a first order Taylor expansion The joint multivariate distribution of the observed and latent process is where L is some probability law parametrized by the first two moments. Using the multivariate regression formulation we have the conditional moments of x k see at Appendix B for more details. We hence obtain posterior distribution x k|k ∼ N (x k|k , V k|k ), which is approximated to be Gaussian. This will be the starting distribution for the inference at time k + 1. The filtering procedure is shown in Algorithm 1. In Figure 2 we show a visual representation of the algorithm: at each time point the model takes as input an adjacency matrix and returns the locations in the latent space.
In the update step the latent locations are updated according to the magnitude of the prediction error: a larger error in the prediction corresponds to a wider change in the locations. The filtering matrix K k , capturing the linear relationship between the latent and observed processes, weights this prediction error. K k is the ratio between the noise R k and the latent variance Σ. Thus K k filters the prediction error according to the signal/noise ratio. Fahrmeir (1992) simply consider it as a single Fisher Scoring step, see Appendix E.
The Kalman filter can be interpreted as both a frequentist and Bayesian method. Under a Bayesian perspective the filtering procedure consists of a sequence of updates of the posterior mean and variance (Gamerman, 1991(Gamerman, , 1992West et al., 1985). From the frequentist side the estimation based on the posterior mode is equivalent to the maximization of a penalized likelihood (Fahrmeir and Kaufmann, 1991;Fahrmeir, 1992), see Appendix E. Approximating the posterior distribution with the same family of the prior, i.e., Gaussian, the posterior mean is equivalent to the posterior mode and hence the equivalence of the two approaches. This double interpretation makes Kalman filters appealing for both types of applications.

Smoother
The smoother moves backward from the last prediction to the first. It calculates the first moments of the latent process conditioned to the information of all time points. Similarly as the EKF, the backward matrix B can be calculated considering the multivariate distribution of the latent locations at two consecutive time points, Using the multivariate regression formula we have the conditioned mean of According to the conditional independence in Figure (1) we have (x k−1 ⊥ y k:n )|x k since x k closes the dependency path. Using the iterated expectation rule we havê wherex k−1|k−1 andx k|k−1 are constants. In the same way using the iterated variance rule see at Appendix C for more details. The smoothing procedure is presented in Algorithm 2 and it is known as the Rauch-Tung-Striebel smoother. The final iteration of the smoother updates the starting valuesx 0|0 and V 0|0 . These values will be used as starting points for the successive EM iteration.

M-Step: a Generalized Additive Model
In the maximization step we maximize the log-likelihood with respect to the parameters β, Σ and we make the first distinction between the continuous (3) and discrete (6) time models. For the continuous time process N the expected log-likelihood is For the discrete time process Y the expected log-likelihood is Notice that the Poisson component Q P (β) and exponential component Q E (β) do not depend on Σ as well as the Gaussian component Q G (Σ) does not depend on the remaining parameters β. These quantities can be optimized separately.

Gaussian component
We can maximize the Gaussian component finding the zero of the first derivative with respect to Σ. Rearranging the elements and taking the expectation as shown in Appendix D we obtain This result corresponds to the one presented in Fahrmeir (1994). Substituting V k|n B k = Cov(x k|n , x k−1|n y 1:n ) we have the equivalence with the result of Watson and Engle (1983). It is crucial to have a good estimate Σ. Having Σ small implies that a little portion of the prediction error is used to update the locations and therefore the latent process moves slowly and delayed. When Σ is high the estimated latent process is heavily influenced by the last observation and have a tendency to overfit the observed process. In some practical fields Σ is tuned manually by searching for overfitting or delayed behaviors in the errors. Our EM provides a precise solution and avoid the manual tuning.

Poisson component
For arbitrary exponential family distributed edges, as described in Section (2.2), the observed process component can be maximized numerically with a general optimization algorithm. However, for Poisson distribution a more elegant solution is available. The expectation of the Poisson component for the discrete time process Y can be rearranged as follows where, up to an additive constant, the expected log-likelihood can be formulated as a Poisson log-likelihood with the associated rates The optimization can be performed by fitting a Generalized Additive Model (Wood, 2013) with this linear predictor and the offset log(E[e −d(xi(k),xj(k)) |y 1:n ]). See Appendix D for the full derivation. The expected value in the offset cannot be further simplified. We use a second order Taylor approximation, which can be expressed as a function of the first two moments of the latent locations, E[x k |y 1:n ] and V[x k |y 1:n ]. Consider g i,j (x) = e −d(xi(k),xj(k)) , then the expectation of the Taylor expansion at x k|k is where the expectation of the first derivative term is zero. The GAM model is an elegant way to specify the remaining fixed and random effects. This formulation is very general and allows to estimate constant and linear effects or to use splines for estimating non-linear and time-varying effects.

Exponential component
The expectation of the exponential component for the continuous time process N is µ ij (∆t ij,l ) + log λ ij (t kij,l )|y 1:n   Note that, up to a multiplicative constant y ij (k), the exponential log-likelihood factorizes similarly to that of the Poisson. Even in this case the expected log-likelihood can be rewritten as an exponential log-likelihood with the same offset as (9). The inference is performed via survival regression with rates and exponential waiting times.

Higher order approximation
The EKF is based on a first order Taylor expansion in (18). We can approximate the µ function with a order higher. A popular solution is the Unscented Transformation, the key solution of the Unscented Kalman Filter (UKF) Uhlmann, 1996, 1997). The algorithm has a similar shape as the EKF with the difference that the filtering matrix K k is calculated empirically. We begin with a fixed number of points to approximate a Gaussian by creating a discrete distribution having the same first and second (and possibly higher) moments. Each point in the discrete approximation can be directly transformed. The mean and the covariance of the transformed ensemble can then be computed as the estimate of the nonlinear transformation of the original distribution.
Given a pd-dimensional Gaussian having covariance V k|k−1 we can construct a set of points having the same sample covariance from the columns (or rows) of the matrices (κ + pd)V k|k−1 . The square root of the matrix is typically done via a Cholesky decomposition. Adding and subtracting these points tox k|k−1 yields a symmetric set of 2pd + 1 points (central point included) having the desired sample mean and covariance. This is the minimal number of points capable of encoding this information (Julier and Uhlmann, 1996). We then calculate the sample mean and covariance of the transformed points. Finally, the filtering matrix K k can be calculated as the rate between the sample covariance and the sample variance.
The Unscented Kalman Filter is presented in Algorithm 4. The prediction and the update step are the same as those of the EKF. The κ parameter regulates both the weight of the central point and the spreading of the other points: a large κ leads to a wider spreading of the points. Julier and Uhlmann (1997) suggests a useful heuristic to select pd + κ = 3. The use of the Unscented Kalman filter makes the computation of (10) straightforward by simply taking the sample mean of the transformed ensemble.

Computational aspects
The p 2 × p 2 matrix inversion in (8) represents a computational bottleneck in many Kalman filter applications. However there are cases where the dimension of the latent process is much smaller than the observed process dimension. The Sherman-Morrison-Woodbury identity can be employed and requires p × p matrices inversion only. As the latent space employed by our model has a cheap p-dimensional representation our scenario is particularly appealing for the application of the Sherman-Morrison-Woodbury identity. The identity is closely related the Information Filter, see the Appendix E, which usage is equivalent. The overall computational cost of the algorithm is therefore dominated by the inversion of a p × p matrix (Mandel, 2006).

Model selection
The conditional distribution of the latent space x conditioned to the observed process y can be used for assessing the uncertainty about the latent process. Variability bands can be draw by using the quantiles of the distribution x k|n ∼ N (x k|n , V k|n ) and the user can visually check whether the dynamic locations are far from being a constant line, as shown in Figure 4. Akaike Information Criterion. The dimension d of the latent space can be selected by using some Information Criterion such as the cAIC where Φ is the effective degrees of freedom of the fixed and random latent part of the model. Saefken et al. (2014) present a unifying approach for calculating the conditional Akaike information in generalized linear models that can be used in this context. This allows us to select the latent space dimension d that minimize the conditional Akaike criterion. The cAIC is also used for making selection over the two filters, EKF and UKF, or to choose between different Σ structures, e.g. a diagonal matrix with either the same or different variance parameters. In the same way we use the cAIC to choose a static or a dynamic model. The static model, where all the locations are fixed in time, can be obtained with a modification of our algorithm. The static model can be viewed as a dynamic model with one single time interval, obtained by grouping together all the time intervals. The filtering procedure is reduced to the update of the locations at the starting point and at the single interval, with the convergenceΣ → 0.
Goodness-of-fit. We can assess the model goodness-of-fit in the same way as done in multivariate generalized linear models. Residuals plots can be useful for spotting violations of the assumptions, e.g., the latent space assumption, the family and thus the correct variance function. Although it is possible to inspect all p(p − 1) fits on the counts y k , we recommend a cheaper way. Residuals can be inspected by plotting the sequence of locations x k|n where the links are colored differently according to the studentized residual. We can choose red links for large residuals and green for the small ones, with all the shades in the middle. In case the variance function is misspecified we expect to observe more red links for closer nodes. In case the latent space assumption is violated we expect to see red links evenly spread over the network.

Simulation study
In order to assess the method performance we carry out a simulation study. We specify logistic functions for the latent location trajectories x k , rescaling and shifting these functions in different ways. The link counts are generated from a Poisson distribution with log(µ ij (x k )) = α − x i (k) − x j (k) 2 2 . In Figure 5 is shown a possible set of locations, the black lines. We simulated the observed Y process 200 times from these trajectories. The colored lines are the 200 trajectories estimated by the EM-EKF. We simulated with p = 10 nodes, n = 100 intervals and d = 2 dimensions.
The study that we carried on consists of a set of simulations that investigate the model behavior in different scenarios. We consider the model with p = 10, n = 100, d = 2 and we vary the number of nodes, intervals and dimension. We also propose some challenges to the model such as the mispecification of the distribution family, high clustering or sparsity behavior. We also report the static model performances as a baseline for comparison. We use the out-of-fold Kullback Leibler divergence as performance measure where y new denotes an additional sample that is generated from x true . The Kulback Leibler is a performance measure based on the distance matrix, which is invariant to rotations and translations of the locations. Varying the number of nodes p. Figure 6 shows the results of varying the number of nodes p = 5, 10, 25, 50. EKF and UKF have almost the same performance that improves as p increases, as a consequence to the increment of information to our model. The dynamic latent space clearly outperforms the static model, whose KL fit remains stable with varying p.
Varying the number of intervals n. Figure 7 shows the results of varying the number of observed time intervals n = 10, 50, 100, 1000. For the dynamic models there is a strong performance improvement for low n, reaching a plateau beyond n = 100 where adding other intervals does not have an important contribution to the KL. For n = 10 we show that even for low number of intervals the dynamic model provides a better result than the static model.
Varying the latent dimension d. We did notice a slight decrease in the performance when increase the latent dimension. This can find a possible explanation in the number of observations np(p − 1), which increase as we increase p and n. The latent dimension d gives no contribution to the number of observations and hence we observe no real difference in the performances.
Computational costs. Figure 7 shows that the computational cost grows approximately linearly with n, as the filter replicates the same matrix operations n times. Differently to n the computational costs in Figure 6 grow non-linearly with the number of nodes p, Mandel (2006). Similarly to the results in the performances, varying d does not make a substantial difference in the computational costs. Effect of overdispersion. In Figure 8 we investigate the model behavior under overdispersion. We simulate the data from a Negative Binomial with mean µ ij (x k ) and a quadratic variance function µ ij (x k ) + µ ij (x k ) 2 and compare it to data simulated from a Poisson distribution. We study the performance of our Poisson model under different ranges of rate µ ij (x k ). For low rates the Negative Binomial variance is almost the same as that of the Poisson, and here we observe the same performances over the two settings. For high rates the fit on Negative Binomial counts get worse and is comparable to that of the static model. For the highest rate the signal-to-noise ratio in the data is so low that the model diverges in all the simulations. In these cases the solution is to change the distribution specification and fit it with the right variance function. The average link rate is related to the sparsity in the observed counts y. Figure  8 shows that the model still work even in high sparsity settings without divergence problems. This allows the user to freely specify a high number of intervals n for the analysis.
Considerations on identifiability. The latent formulation is identifiable in the relative distances but unidentifiable in the locations (Hoff et al., 2002): infinite combinations of rotations and translations have the same distances and therefore the same likelihood. This implies the non-identifiability of Σ, as the coordinate system rotates. Each update of the filter and smoother may involve a certain shift and rotation in the next location configuration. As a result when we update the starting points x 0|0 for the next EM iteration they may be shifted and rotated, with related rotation for Σ. These movements become stable as the starting points x 0|0 converge. In case identifiability is required in the analysis the user can specify Σ spherical or spherical within each node, obtaining Σ unaffected by rotations.   Considerations on filter divergence. A practical aspect that most Kalman Filter users deal with when working on real data is the divergence problem. Many factors can influence the divergence tendency such as a wrong variance function in R k , poor approximation of non-linearity, inappropriate initial choice β, abrupt changes in link rates, too large variances V 0|0 and Σ. In those case R k is problematic and might then be approximated by R k−1 . In case of bad starting points x 0 the update of locations might have abrupt changes because in a non-convex likelihood optimization locations jump to find a more stable configuration. Fine-tuning parameters and starting points can make a difference, when divergence occurs. Problematic R k can be solved by taking more update steps on the same time point (Fahrmeir, 1992). Inflating R k solves overdispersion problems, although inferring the correct variance function of the data might take some extra effort. Sufficiently good x 0|0 points can be calculated via Multidimensional Scaling or reversing the time dimension and run the Kalman Filter backward. Furthermore, we recommend starting the EM from the static model, thus Σ low, and then expand it slowly toward the maximum likelihood point, as starting with a high Σ and V 0|0 may overfit the data. In most pathological cases the model diverges before reaching the maximum likelihood point and a profile maximum likelihood estimate will be the best alternative. Another delicate aspect is the rate function choice. The function e − xi(k)−xj(k) 2 2 is appealing because is differentiable. However it can be more unstable than other non-differentiable functions that exhibit a weaker non-linearity. Every choice brings different complications and there is not an optimal choice for all scenarios.

Dynamics of patent citation patterns
The patent citation process presents some peculiar characteristics: patents are continuously added to the system and the citations happen in the moment of the patent creation only. A patent can cite only patents that are previously added and not the ones that are added in the future. In this analysis we group all these patents by the same ICL class and we use these fields as the unit of our analysis. Since there is a continuous exchange of citations between the fields, the resulting process can be regarded as a point process.  where i and j are two fields, α 0 is an intercept and sender i and receiver j are respectively, the sender and receiver random effects. The citation rate is proportional to the number of patents added in a field within a year. If in a certain year there are no patents added in a field, the rate must be set to 0. We therefore specify an additional offset C i (k) that account for the number of patents added in field i at time k. The inclusion of C i (k) brings a different interpretation and hence we are modeling the citation rate per single patent in class i. We consider a bidimensional latent space for the sake of visual representation. We fitted both the EM with EKF and UKF obtaining similar results, as anticipated by the simulation study. Figure 11 presents the estimated locations for the fields as well as sender and receiver effects. The legend letters match the mentioned classification of fields.
The sender and receiver effects can be interpreted as the asymmetry between fields citations that the latent space representation fails to capture. Figure 11(d) show how the Textile, Papers and Fixed constructions classes are very low receiver classes, meaning that they are cited below average. Figure 11(c) shows that Physics patents a low tendendency to cite others. The high sending and receiving tendencies of the Chemistry, Metallurgy and Electricity patents must be seen in the context of Figures 11(a) and (b): the fact that we observe such huge effects jointly together with their distant location to the other patent classes might suggest some violation of the model assumptions. The two locations should be closer to the main cluster but there is not a latent configuration that makes a good fit. For comparison we fit the model without random sender and receiver effects: Figure 12(b) shows that the distances of the Chemistry, Metallurgy and Electricity patent classes were inflated and that the random sender and receiver effects  were indeed capturing the misrepresentation. The Physics patents comes now very close to Electricity, whereas the Chemistry and Metallurgy class overlaps with Human necessities. By looking back at the discrepancy between sender and receiver effects we see that Chemistry and Metallurgy patents have the tendency to receive more from Human necessities, whereas the Physics patents receive more citations from Electricity. In Figure 12(b) Textile, Papers and Fixed constructions classes are pushed far away as the latent space accounts now for their negative receiver effect. Figure 10 shows a peculiar behavior as locations are static in the initial 10 years. Patents can only cite back in time and therefore the first patents added in the system cannot cite patents submitted before the year 1967. The Figure suggests that around 1976 the patent citation process start behaving "correctly", i.e., that the database starts to include most cited patents. This seems reasonable as patents cite an average of 10 years back in time, with a mode that is significantly less than 10 years.
In general we can observe that the exchange of citations between different fields increases trough time, ending with a large cluster including the majority of the ICL categories. The overall conclusion for this analysis on the Patents data is that there is an increment in the connectivity between different fields. This suggests that most technology classes are becoming less dissimilar: there is an increasing heterogeneity within the fields, as they communicate with other technology fields, and thus a higher homogeneity between the fields.

Conclusion
In the last decade REMs have been used for describing the drivers of dynamic networks interactions. Traditional approaches focus on endogenous and exogenous drivers, which may not always be able to capture all heterogeneity in the data. Our aim has been to extend relational event modelling by letting their interactions depend on dynamic locations in a latent space.
Our estimation approach of the latent space relational event model combines several methods: the Expectation Maximization algorithm, Kalman filters and Generalized Additive Models. We consider the latent locations as missing states. The filter calculates their conditional expectation and the Generalized Additive Model performs the maximization: the two main ingredients for an EM algorithm. Kalman Filters are effective methods for estimating latent dynamic processes. Their simplicity and intuitive usage make them suitable for many problems, commonly in engineering contexts. The filter relies on a sequence of linear operations and easily calculates the Expectation step, typically untractable for non-trivial cases. The Kalman filter dual interpretation in both the Bayesian and frequentist literature would also make an effective within-Gibbs implementation, instead of a within-EM implementation, possible. The sequence of updates in the latent space makes the Kalman filter an effective tool for tracking the movements of the latent locations, as already proved in many applications. Our model formulation is very general and can encompass all the Generalized Additive Model features such as fixed effects, random effects and smoothly time-varying effects.
The simulation results show that the model is accurate, computationally feasible and insightful under different scenarios. The patent citation analysis gives an interesting interpretation on innovation dynamics in the period 1967-2006 where many traditionally distinct patent classes show a marked convergence in a latent knowledge space.

E. Alternative derivation of EKF
The Poisson distribution can be written in the natural exponential family formulation (McCullagh, 2018): : R py → R py .
The advantage of writing the Poisson distribution in the natural exponential family form is that the further developments will be valid for any distribution of the natural exponential family. Other exponential family distributions are possible specifying differently the functions θ(·) and b(·). The likelihood can be then written as L(β, Σ; y, x) = n k=1 1 √ 2π |Σ| −1 e − 1 2 (xk−xk−1) Σ −1 (xk−xk−1) c(y k )e θ yk−b(θ) We obtain the correction step via maximum likelihood. The likelihood that we are treating here is different than the one presented in (??). We are taking the single likelihood contribution at time k conditioned to the inference at the previous time point. Thus the marginal distribution of the latent process is substituted with its conditional distribution, i.e., the distribution that we calculated in the prediction step. The likelihood is presented as were V k|k−1 represent the variance of the latent process conditioned to y k−1 . From a frequentist point of view (17) is a penalized likelihood, composed by the Poisson probability of the observations and a penalty term for the latent process. In a Bayesian setting it can be considered a posterior distribution, where the penalty represents the prior distribution. The penalty/prior regulates the smoothness of the process via the covariance matrix Σ. The maximization of the posterior density is equivalent to the maximization of the penalized likelihood (Fahrmeir, 1992). We maximize this likelihood according to x k , to obtainx k|k . This clearly is not equivalent to the conditional mean, except in case the posterior mode coincide with the posterior mean. This is true for the Gaussian density, which is not our case. The posterior is therefore approximated with the same family distribution of the prior, i.e., Gaussian, see Gamerman (1991) and Fahrmeir (1992). Thus we are approximating the posterior mean with the posterior mode.
Using the chain rule, we take the derivative of the likelihood respect to x k and transposing it we have A first order Taylor expansion is applied on the mean of y k Setting ∂ ∂xk l k (x k ) = 0 and rearranging the members of the equation we have x k =x k|k−1 + V −1 k|k−1 + ∂µ(x k , β) ∂x k ∂θ(µ) ∂µ ∂µ(x k , β) ∂x k −1 ∂µ(x k , β) ∂x k ∂θ(µ) ∂µ y k − µ(x k|k−1 ) .
We evaluate the derivatives atx k|k−1 and use the property that the second derivative of b(θ) is equal to the variance of y k |x k . Since x k is unknown, we approximate it witĥ x k|k−1 . ∂θ(µ) ∂µ xk|k−1 = Setting ∂µ(x k , β) ∂x k xk|k−1 = H k and considering that µ(x k|k−1 ) = H kxk|k−1 we obtain the updatê