Flexible marked spatio-temporal point processes with applications to event sequences from association football

We develop a new family of marked point processes by focusing the characteristic properties of marked Hawkes processes exclusively to the space of marks, providing the freedom to specify a different model for the occurrence times. This is possible through the decomposition of the joint distribution of marks and times that allows to separately specify the conditional distribution of marks given the filtration of the process and the current time. We develop a Bayesian framework for the inference and prediction from this family of marked point processes that can naturally accommodate process and point-specific covariate information to drive cross-excitations, offering wide flexibility and applicability in the modelling of real-world processes. The framework is used here for the modelling of in-game event sequences from association football, resulting not only in inferences about previously unquantified characteristics of game dynamics and extraction of event-specific team abilities, but also in predictions for the occurrence of events of interest, such as goals, corners or fouls in a specified interval of time.


Introduction
Football is one of the most popular team sports and is an example of an invasive sport, where two opposing teams compete for the possession of the ball with the dual objective of attacking to score a goal and defending against attacks from the opposition.Most analyses in football are typically done manually by studying video footage or using simple frequency analysis of match events.Hence, there is huge scope to improve the efficiency of the data-analytic methods as well as the quality of performance evaluation.However, the analysis of football data is mathematically and statistically challenging due to the continuous interaction between players within and across the two teams.As an introduction, we describe the event data from football and survey the existing work in this area before arguing how marked point processes are well-suited to developing a modelling foundation to achieve our goal of describing the game dynamics.

Football event data
Over the last decade, the availability of spatio-temporal data from team sports has inspired research into the application of statistical methods for team and player performance evaluation.A comprehensive survey of the recent research efforts into the spatio-temporal analysis of team sports is provided in Gudmundsson and Horton (2017).There are two primary types of spatiotemporal data collected from team sports.Movement data consists of samples of timestamped locations in the plane tracking the movement of all players and the ball during the game.Player movement is captured using fixed cameras in optical tracking systems, that process the images to obtain the trajectories.Event data streams, on the other hand, record the sequence of events that occur during the game and are collected manually by trained analysts who watch video feeds of the games through a special annotation software.As our work is motivated by the availability of event data from football, we focus on reviewing research that uses event data streams.Event data are less dense than movement data, but richer in the sense that they contain more information about what is happening in the game.Events broadly fall into two categories; player events such as passes and shots; and stoppage events such as fouls, end of game etc.Every event is annotated with a timestamp, its location, its type (pass, foul, etc.), the players involved, and team information.
A popular research topic based on event data is the network analysis of player interaction.Models for player interaction can quantify a team's playing style as well as the importance of an individual player within the team.Players are identified as nodes of a network and are connected using directed edges whose weights are proportional to the number of successful passes between the two players.Passing networks were first applied to team sports in Passos et al. (2011) to study a team's collective behaviour in water polo.Grund (2012) studied the degree centrality of passing networks in football, which quantifies the importance of nodes in the network based on the number of edges.They showed that teams that rely heavily on key players performed relatively worse.Duch et al. (2010) used flow centrality to assess player performance by capturing the fraction of times that a player intervenes in those paths that result in a shot on goal.They also take into account defensive behaviour by letting each player start a number of paths proportional to the number of balls they recover.Clemente et al. (2015) studied the density and heterogeneity of passing networks and showed that high heterogeneity leads to formation of sub-communities, meaning there is a low level of cooperation between the players of a team.Pena and Touchette (2012) looked at other centrality measures such as closeness and eigenvector centrality as well as clustering in football passing networks.
Another use of event data is in the identification of frequently occurring sequences of passes between a small group of players within the same team.In Borrie et al. (2002), passes are identified by the zones in the pitch they start and end in and frequently occurring sequences are detected by also taking into account the time intervals between passes.Wang et al. (2015) proposed an unsupervised approach to automatically detect tactical patterns in football.They present the Team Tactic Topic model based on Latent Dirichlet Allocation to identify tactics from pass sequences.Interesting visualisations are provided for the most successful tactics as well as how a team's tactical patterns evolve over a season.Van Haaren et al. (2016) also look at automatic discovery of patterns in attacking strategy.They use a data-driven approach to determine a number of spatial features about the areas occupied during a continuous possession phase of a team.The features are then used to cluster similar phases together to identify frequently occurring event sequences within the cluster.Decroos et al. (2017) partition the game using overlapping intervals to create subsequences of events to use as a feature to predict a goal event in the near future.They compute similarity between subsequences using Dynamic Time Warping, a distance measure for time-dependent sequences.
Extracting game states from event sequences to quantify the value of player actions or to make predictions of the game outcome is another interesting area of research.Routley and Schulte (2015) used Markov decision processes for valuing player actions in Ice Hockey.Game states are derived from contextual features like game score and time remaining along with the recent history of events.The associated reward for an action in the Markov decision process gives the value of the player action.A similar approach based on game states is taken in Decroos et al. (2018) to value player actions in football.They train a classification model to calculate the probability a game state will lead to a goal in the near future, where each game state is described using over 150 features.The value of a player action is then calculated by the shift in the predicted goal probability before and after the action.Other approaches for predicting goal probabilities based on a current game state are by Mackay (2017) and Robberechts et al. (2019).Approaches based on game states involve significant effort into feature engineering, and with the use of learning algorithms like gradient boosting that limit parameter interpretations, the methods provide, typically, little insight into the dynamics of the game.
The major focus of existing methods in team sport analysis appear to be tailored towards individual player performance evaluation or identifying specific patterns in team play.Most approaches take the route of summarising the event data into compact representations like networks and game states.In this paper, we take a more holistic approach to study football as a dynamic system and model the entire sequence of events within a game.Such a model, that captures all event interactions, is attractive for predicting the occurrence of the rare goal scored events, that determine the outcome of the game.

Point processes
Phenomena that are observed as a sequence of events happening over time can be represented using point processes.While point processes can describe a random collection of points in any general space, we limit ourselves to the case in which the points denote events that occur along a time axis.Such point processes, having a natural order in which the points occur, are suitable for a wide range of real-world applications and are well studied in probability theory.
As in Daley and Vere-Jones (2003, Section 6.4), processes in which points are identified only by the occurrence times are referred to as univariate point processes.Multivariate point processes, on the other hand, are those in which the realisation of a discrete random variable, say m, with a finite number of categories is recorded along with the occurrence times.Marked point processes are processes where m is allowed to be a continuous random variable.An example application of a marked point process with continuous marks is in seismology, where the magnitude of an earthquake is recorded in addition to the time of occurrence.In this paper, we model event sequences observed in football using marked point processes with discrete marks used to denote the event type.
When event sequence data are analysed using point process models, an important distinction is between empirical models and mechanistic models as noted by Diggle (2013).Empirical models have the solitary aim of describing the patterns in the observed data, while mechanistic models go beyond that and attempt to capture the underlying process that generated the data.Mechanistic models for marked point processes are typically specified using a joint conditional intensity for the occurrence times and the marks and in general are not flexible enough to be applied to complex real-world phenomena.The joint modelling of the components of the process can also be challenging and it is common to make strong restrictive assumptions like separability (González et al., 2016) to simplify the model.In this paper, we present a flexible mechanistic modelling framework for marked point processes that are suitable for a wide range of applications without the need for assumptions like separability.
We produce a family of marked point processes that generalises the classical Hawkes process, a mathematical model for self-exciting processes proposed in Hawkes (1971) that can be used to model a sequence of arrivals of some type over time, for example, earthquakes in Ogata (1998).Each arrival excites the process in the sense that the chance of a subsequent arrival increases for a period of time after the initial arrival and the excitation from previous arrivals add up.Marked Hawkes processes are typically specified using a joint conditional intensity function for the occurrence times and the marks (see, for example, Rasmussen, 2013, expression 2.2), and captures the magnitudes of all cross-excitations between the various event types as well as the rate at which these excitations decay over time.Excitation leads to clustering of events in time as the process is driven by an intensity that increases with every arrival for a short period of time.However, in applications like the event sequences observed in football, the events tend not to cluster in time and the marked Hawkes process model is not suitable.The joint modelling of the times and the marks has to be decoupled to restrict the excitation property of the process exclusively to the dimension of the marks.
Similar to the decomposition of a multivariate distribution function that motivated the partial likelihood in Cox (1975), we factorise the joint conditional distribution for a marked point process into probability density functions for each event time conditioned on the past occurrences times and marks, and probability mass functions for the event marks conditioned on the time of occurrence and the filtration of the process.Therefore, an alternative approach to specify a marked point process model is to specify the conditional distribution functions for the times and the marks separately.We derive the conditional distribution function for the marks from a marked Hawkes process, which gives us then the freedom to specify the conditional distribution for the times separately.In this way, we are able to construct marked point process models that retain the characteristic properties of Hawkes processes, such as excitation for the marks, while avoiding the strong clustering of event times.
We develop a framework for Bayesian inference of such flexible marked point processes, which is realised through the Stan (Stan Development Team, 2020) software for statistical modelling.Stan implements a variant of the Hamiltonian Monte Carlo algorithm, originally proposed by Duane et al. (1987), to generate samples from the posterior distribution of the parameters.The Bayesian models we consider are compared using the out-of-sample log predictive density.
We define marked point processes for the modelling of touch-ball events in football, which along with time and event type information also carry location information.As we illustrate, the family of marked point processes can be readily enriched to handle all times, event types and locations.We are also able to incorporate team information in a direct way that captures the relative abilities of the teams for each event type.We develop a method based on association rules (Agrawal et al., 1993) to reduce the complexity introduced by the model extensions we introduce.The rule-based approach identifies significant event interactions within sequences by placing thresholds on particular measures of significance.We then evaluate the accuracy of the excitation based models by comparing against two baseline models and confirm the superior performance of the models with excitation effects.
We provide a detailed parameter description showing how the model parameters can be used to gain valuable insights into football.The excitation component of the proposed model captures both the magnitudes and the durations of all pairwise event interactions across different locations.From the conversion rate parameters, we are able to confirm the well-known home advantage effect, and quantify the relative performance of each team when playing games at their home venue compared to away.The conversion rate parameters are also driven by team information, via the team ability parameters which, for example, can capture the relative ability of a team to convert one successful pass to another and retain possession of the ball.We also discuss how the team ability parameters can be used to obtain rankings for the teams by event type, that can be used as predictors of team performance.The team ability parameters also capture some interesting differences in the playing styles of the teams, that are not immediately apparent just by looking at the event data.In this way, the model along with its parameters Table 1: Events and their attributes from the first 20 seconds of the game between Southampton and West Ham United on September 15, 2013.For each event, we have records of the event time-stamp, team and player ids, event type, (x, y) co-ordinates of its location in the playing field, and if the event type is a Pass, the event outcome (successful/unsuccessful) and the end (x, y) co-ordinates.can be used to develop a deeper understanding of the game-play by coaching staff and inform strategic decision making.The proposed model can also be used to simulate the sequence of events in a game to obtain real-time predictions of event probabilities.The simulator results in predictions that can enhance, among others, the viewing experience of televised games.Finally, like Hawkes Processes, the proposed model also allows the recovery of the hidden branching structure of the process that quantifies the relative contributions of the background process and previous occurrences to the triggering of a new event.
The developments in this paper can be readily applied to many other team sports like rugby, hockey, basketball etc.As none of the methods have been tailored specifically to football or even sports for that matter, they can also be applied to a wide range of applications that generate event data streams.

Description and descriptives
The data that motivated this work was provided by Stratagem Technologies Ltd, and consists of all touch-ball events from all English Premier League games in the 2013/14 season.A touchball event is an event where a player has acted on the ball by touching it with some part of their body.We identified mistakes in the original data, with the most critical issues relating to impossible sequences of consecutive events (e.g. a dribble a few seconds after a goal).Such data issues have been addressed in a systematic way, using the data-cleaning workflow in the publicly-available PhD thesis by Narayanan (see, Narayanan, 2021, Section 5.3).In total, the data consists of over half a million touch-ball events recorded over the season along with other attributes.A snapshot of the data is provided in Table 1.The league is contested by a total of 20 teams and follows a round-robin tournament scheduling, where each team plays every other team at their home and away venues, which results in a total of 380 games over the season.
Each game comprises two halves that are separated by an interruption of approximately 15 minutes.In what follows, we refer to each uninterrupted game half as a game period.For each touch-ball event, we have records of the event type, time-stamp, (x, y) co-ordinates of its location in the playing field, team and unique player identifiers, game period, and if the touch-ball event is a Pass, the event outcome (successful/unsuccessful) and the end (x, y) co-ordinates .Table 2 gives the frequency of each of the 22 distinct touch-ball event types recorded in the data.
Figure 1 shows the trajectory of the ball during an attacking move that led to a goal in the 18th minute of the game between Arsenal and Norwich City on October 19, 2013.The goal Figure 1: Tracing the locations of the sequence of events that led to the goal scored by Jack Wilshere for Arsenal against Norwich City (voted the best goal of the 2013/14 season).
was scored by Jack Wilshere for Arsenal and was voted as the best goal in the English Premier League for the 2013/14 season.Latent game characteristics, such as the home advantage, and differences in playing styles and formations between the teams are also reflected in the touch-ball events.For example, Figure 2 compares the concentration of ball touches for Arsenal and Chelsea between their home and away games in the 2013/14 season.The playing field is plotted so that the team is always attacking to the right.It is clear that when the teams play at home the density of events is higher towards the opponent's goal.In fact, the point process modelling framework developed in this paper allows us to quantify home advantage by, for example, learning the relative ability of each team to retain possession when playing at home compared to away (see Section 6.6).

Data preparation
We combine the types and outcomes of touch-ball events into in-play and terminal composite events.The in-play composite events are Win, Dribble, Successful Pass (Pass S), Unsuccessful Passes are deemed to be successful when the ball is received by a teammate and unsuccessful otherwise.Shots include all attempts on the opponents' goal, including those missing the target.The Keeper event denotes the goal keeper taking possession of the ball into their hands by picking it up or claiming a cross.The Keeper event is unlike any other in-play event, as the goal keeper is allowed to hold the ball without being challenged for a period of time while waiting for opponents to clear the goal area.As a result, there is often a delay before the next event even though the ball is technically in-play.Saves are events where the goal keeper prevents a shot from crossing the goal line.Clear events are those where a player moves the ball away from their goal area to safety while the Lose event is when a player loses possession of the ball.The terminal composite events are those which result in the ball going out-of-play and are Goal, Foul, Out Throw, Out GK, Out Corner and Offside Pass (Pass O).The terminal events interrupt the game resulting in a delay before play resumes.Each composite event is tracked for both the home and away teams.For this reason, we append "Home" or "Away" as a prefix to the event label to distinguish between the events of the two teams playing the game.This results in M = 30 distinct composite events, whose labels and observed frequencies are given in Table 3.
For each touch-ball event, the data also contains the associated (x, y) coordinates on the playing field.We partition the playing field into 3 zones of equal area.The zones and their corresponding labels are shown in Figure 3. Zone 1 is the region where the home team defends their goal, zone 3 is the region where the home team attacks, and zone 2 is the midfield region.It is natural to expect that the control a team has on the game depends on the zone the ball is at.For example, the home team is expected to retain possession of the ball more successfully

Zones
Figure 3: Mapping from event location in (x, y) coordinates to zones.
in zone 1 as compared to zone 3. Table 4 shows a snapshot of the data after its preparation, including a unique identifier for each game period in the data.
3 Marked point processes

Conditional intensity function
Sequences of events over time are conveniently represented as realisations of a point process.Oftentimes, the events can carry additional information, which are assumed to be realisations of random variables, referred to as marks.The collection of the times {t i } at which the events occur and the marks {m i } is a marked point process, whose ground process, is the process for {t i } only.
A marked point process is typically specified through its joint conditional intensity function where λ * g (t) is the conditional intensity of the ground process and f * (m | t) is the conditional probability density or mass function of the mark m at time t.Both λ * g (t) and f * (m | t) in (1) are understood as being conditional on F t − , which is the filtration of the marked point process up to but not including t.

Marked Hawkes processes
Marked Hawkes processes are point processes whose defining characteristic is that they selfexcite, meaning that each arrival increases the rate of future arrivals for a period of time.More formally, consider a realisation of a marked point process, consisting of event times {t i } with t i ∈ R + and t i > t i−1 , and marks m i ∈ {1, . . ., M } (i = 1, . . ., n), where M is the number of discrete marks.The marked Hawkes process is most intuitively specified using its mark dependent conditional intensity function λ * (t, m), which for an exponentially decaying intensity is (Rasmussen, 2013, expression 2.2), In ( 2), the parameter µ > 0 is a constant background intensity and δ m ∈ (0, 1) is the background mark probability for mark m with M m=1 δ m = 1.The parameter ∈ (0, 1) is the excitation factor, β > 0 is the exponential decay rate and γ m j →m ∈ (0, 1) is the probability the excitation from an event of mark m j triggers an event of mark m, with M m=1 γ m j →m = 1 for any m j ∈ {1, . . ., M }.

Limitations of the marked Hawkes process model
The specification in (2) describes a marked Hawkes process that is linear in the sense that the excitations from different arrivals add up, not only increasing the probability of triggering an event of a particular type, but also concentrating the occurrence times for a certain period of time.For this reason, marked Hawkes processes have proven useful in a wide range of applications, where events tend to cluster in time, such as the modelling of earthquakes (Ogata, 1998), gang violence (Mohler et al., 2011) and financial market events (Bowsher, 2007).
However, in applications like the modelling of event sequences in football, each event triggers other events of a particular type with high probability, while it is not necessarily true that the occurrence times cluster.
As an illustration that the observed events in football do not exhibit clustering, consider only the collection of the times {t i } at which the events occur.A succinct method to investigate the aggregation of the points is using the non-parametric Ripley's K-function summary (Ripley, 1977), which is the reduced second moment measure.An estimator of the K-function in the one-dimensional case is derived in Diggle (1985) as where (0, T ) is the time interval over which the n points are observed, 1(.) is the indicator function, and w ij is an edge correction taking values

P oisson process Hawkes process Gamma process Observed t imes
Figure 5: Comparing the cumulative distribution functions (CDFs) of the inter-arrival times of events simulated from a Poisson process (green), a Hawkes process (orange), a Gamma process (purple) and observed event times in football (pink).Empirical CDFs were computed using 10, 000 inter-arrival times in each case.
For a homogeneous Poisson process, K(t) = 2t.If K(t) > 2t, the process is said to be over-dispersed relative to the Poisson, and exhibits some degree of clustering.If K(t) < 2t, the process is under-dispersed relative to the Poisson, and tends towards regular occurrences.Figure 4 shows the K function estimate, K(t) − 2t for t ∈ {1, 2, . . ., 100}, of the observed event times from the first game of the season between Aston Villa and Arsenal (n = 1279).We also compare the estimates from those observed times with the estimates of the events simulated from several one-dimensional Hawkes processes with conditional intensity of the form, λ * (t) = µ + t j <t βe −β(t−t j ) .Hawkes I (green) is the fitted Hawkes process with parameters (µ, , β) = (0.4183, 0.0035, 0.0004) estimated from the observed times using maximum likelihood.Note that the estimated is very close to 0, indicating no clustering.A Hawkes process with = 0 is the trivial case with no excitation that reduces to a Poisson process (orange) with estimated rate µ = 0.4189.Hawkes II (pink) has parameters (µ, , β) = (0.2594, 0.4, 0.01) and Hawkes III (purple) has parameters (µ, , β) = (0.1068, 0.8, 0.01).Hawkes II and Hawkes III are examples of processes with moderate and severe clustering respectively, whose µ parameters were estimated from the observed times using maximum likelihood after fixing , β.The box plots of the estimates for the Hawkes and Poisson processes were computed using 100 independent simulations of each process over the same time interval as the observed times.
The K(t) − 2t values for the Hawkes II and Hawkes III processes quickly get above 0 and increase with t demonstrating the behaviour of processes with different degrees of clustering.On the other hand, the K(t) − 2t values for the fitted Hawkes process (Hawkes I) and the Poisson process concentrate around 0, being indicative of the expected behaviour of processes where points do not cluster.The K(t) − 2t values from the observed times range from -1.4 to -0.9 indicating slight under-dispersion relative to the Poisson process.In other words, the observed times in football exhibit no clustering and in fact show evidence of being more regular than the Poisson process.
Another method to investigate the aggregation of points is by looking at distribution of the inter-arrival times.Figure 5 shows the empirical distribution function of the first 10, 000 inter-arrival times from the first 7 games of the league season.We also plot the cumulative distribution functions of a homogeneous Poisson process and a Hawkes process, whose parameters are estimated from the aforementioned 10, 000 events using maximum likelihood.The fitted Hawkes process is far from the empirical distribution function, and almost identical to the fitted Poisson process confirming that the arrival times in football do not cluster.This is further evidence that Hawkes processes are not appropriate for modelling events such as those observed in football, that tend not to cluster in time.Figure 5 also includes the cumulative distribution function of a fitted Gamma process, which, as is apparent provides an excellent fit to the observed inter-arrival times.
4 Specification of flexible marked point processes

Decoupling the modelling of times and marks
According to the decomposition of a multivariate distribution function in Cox (1975, expression 2) the likelihood of a marked point process observed in (0, T ) can always be factorised as where g, G, and f are the conditional density and distribution function for the times, and the probability mass function for the marks, respectively, and ζ, θ are unknown parameter vectors that may or may not share components.The last term in (4) accounts for the fact that the unobserved occurrence time t n+1 must be after the end of the observation interval (0, T ).Therefore, an alternative approach to specify a marked point process is to specify the functions g( In this way, we can restrict the characteristic excitation property of marked Hawkes processes exclusively to the modelling of the marks, and have the freedom to specify a different model for the occurrence times.By the definition of the conditional intensity function for a marked point process in (1), where α * = β µ .Expression (5) makes it immediately apparent, that the parameters µ and of the marked Hawkes process as specified by (2) are not always identifiable for general specifications of g(• | F t i−1 ; ζ) in (4).Apart from a mathematical fact, this is also rather intuitive, because µ and in (2) characterise the evolution of the Hawkes process in the time dimension and the sequence of marks is not sufficient to identify them.
The specification of the marked point process likelihood is complete once a probability density function g(• | F t i−1 ; ζ) for the event times is specified.
We should highlight here that the marked point processes from the factorisation in (4) are generally different to the ones that result by assuming separability of the conditional intensity functions (see, for example, González et al., 2016, Section 6.5).A separable conditional intensity functions has the form λ * (t, m) = λ * g (t)f * (m) .and implies that the conditional distribution of the mark does not depend on the occurrence time t.Separability is a convenient assumption because it allows for the sequence of marks to be modelled separately from the sequence of times.In contrast, the factorisation in (4) allows the conditional distribution of the mark to depend on the time of occurrence as well as the history, still allowing for estimating θ separately from ζ, if θ and ζ do not share components.
The proposed marked point process model also allows the recovery of the hidden branching structure of the process, a key feature of Hawkes Processes (Hawkes and Oakes, 1974).In Section 6.8, we calculate the branching structure probabilities and quantify the relative contributions of the background process and previous occurrences to the triggering of a new event.

Parameter interpretation
In (5), the mark probability of each event in the sequence is determined by a combined additive effect from a background component and all previous occurrences.The first term δ m i in the numerator is the mark probability associated with the background component, while each term α * e −β(t i −t j ) γ m j →m i is the contribution from the excitation caused by a previous occurrence in the sequence.
The background mark probability δ m ∈ (0, 1) is the probability an event has a mark m if the event is triggered solely by the background component.The excitation factor α * ≥ 0 is a scaling factor applied to the contributions from the previous occurrences to the event mark probability.Large values of α * indicate a stronger dependence of the process on its history, because the contributions from previous occurrences are weighted higher relative to the background component.The decay rate β > 0 is the exponential rate at which the excitations from previous occurrences decay over time.The parameter γ m j →m i ∈ (0, 1) is the probability the excitation from an event of mark m j triggers an event of mark m i .In other words, γ m j →m i can be viewed as the conversion rate for the transition from an event with mark m j to an event with mark m i .
In summary, as in marked Hawkes processes, the specification for the marks in (5) captures not only all cross-excitations between the various marks but also the rate at which these excitations decay over time.

Covariate-driven cross excitation
The conditional distribution of marks with probability mass function (5), allows to drive the cross-excitation of the marks using covariates.The conversion rates γ m j →m can be linked to a covariate vector x = (x 1 , . . ., x p ) observed at the current time through the baseline-category logit specification (see, for example, Agresti, 2007, Section 6.1) where ω m is an unknown p-vector of regression parameters.Then, keeping all covariates apart from x t fixed, ω mt is the log of the ratio of odds for category m versus the baseline category M at x t + 1 to that at x t (t = 1, . . ., p).Also, by setting all covariates x t equal to 0, φ m j →m i is the log of the ratio of odds for category m versus the baseline category M .The covariate vector x can include a combination of process-specific covariates that are time-invariant, and event-specific covariates.For example, in Section 5.2, we use (6) to parameterise the marked point process in terms of the relative abilities of teams for each event type, and produce team rankings per event-type.

Spatio-temporal marked point processes
We can readily extend the factorisation of the likelihood in (4) to include conditional densities for the event locations, when the latter are observed.We can write where {z i } is the collection of random variables corresponding to the spatial component of the process, which is characterised by the conditional probability mass or density functions h(• | t i , F t i−1 ; η) with η being a parameter vector that may or may not share parameters with ζ and θ, and ψ = (ζ , η , θ ) .The filtration F t i now includes all times, marks and locations up to time t i .
If the process ends at the last occurrence time t n , then the last term 1 − G(T | F tn ; ζ) in ( 4) and ( 7) is not part of the likelihood (see, for example Lindqvist, 2006, Section 4.2).This is the case in the modelling of touch-ball event sequences in football we consider here, where the process ends with or immediately after the last event observed in each half of the game.
5 Bayesian modelling of in-game event sequences

Preamble
The framework for specifying flexible marked point processes of Section 4 is rather attractive for the modelling of in-game event sequences in football and other team sports.Firstly, crossexcitation of in-game events is a natural assumption because any event in an event sequence is likely to be triggered by one or more of the previous events.For example, following a corner kick, the next event is with high probability one among a shot on goal, a defensive clearance or a claim by the keeper.Such effects can be naturally and readily captured by the parameters of the conditional mark distribution in (5), which involves not only the magnitudes of all crossexcitations between the various event types but also the rate at which these excitations decay over time.Secondly, the preliminary analyses in Section 3.3 provides strong evidence that occurrence times do not necessarily cluster, as off-the-shelf marked Hawkes processes imply.Hence, the freedom to use a more flexible conditional distribution for the occurrence times, such as a Gamma process, is a rather attractive prospect.Furthermore, as discussed in Section 4.3, team information can be incorporated in the model in a direct way as covariate information to drive the cross-excitation based on the relative abilities of the teams.
Overall, as we demonstrate later, the framework of Section 4 can be used to provide valuable explanatory tools into the underlying dynamics of the game for the coaching staff and inform strategic decision making.It can also produce predictions of events, such as goals, in a specified time horizon, and of game outcomes that can enhance, among other things, the viewing experience in televised games.

Excitation-based models
Assume that the touch-ball events in S game periods are S realisations of independent spatiotemporal point processes, with the sth realisation involving n s events.Denote by t si , m si , and z si the occurrence time, mark and location of the i-th event in the sth realisation, respectively.Each of the S independent spatio-temporal marked point processes has a likelihood as in (7) after dropping the last term, and with conditional probability mass function for the marks as in (5).The product of the S likelihoods is the overall likelihood based on the S game periods.
The probability density functions for the occurrence times within each period are set to where F st si denotes the filtration of the sth process up to time t si .By this specification, the time to next event is modelled using a gamma distribution with shape and rate parameters that are specific to the mark of the last observed event.In this way, we wish to capture the differences in the expected time to the next event across the different event types.For example, we expect a shorter time to the next event after an in-play event like a Pass, compared to that of an out-of-play event like a Foul.Even within the group of out-of-play events, we expect a shorter delay following a Throw-in as compared to a Goal event.
For a discrete set of locations {1, . . ., Z}, the conditional probability mass function for the current location is set to where η (z si−1 ,m si−1 )→z si is the probability of transitioning into location z si given the location z si−1 and the mark m si−1 of the last observed event.Expression (9) models the sequence of locations as a discrete first-order Markov chain (see, for example, Norris, 1997) with a transition probability matrix η.The current state of the Markov chain is determined by the combination of the location and the mark of the last observed event, and the probability of transitioning into the next location depends only on the current state.The state space of the Markov chain is given by the Cartesian product {1, . . ., Z} × {1, . . ., M }.
We consider four alternative parameterisations for the conditional probability mass function for the marks.The Sβ (scalar β) spatio-temporal marked point process results from ( 8), ( 9) and a conditional mark distribution of the form (5), that is where α = log(α * ).The Vβ (vector β) model results from (8), (9) and where β m is the exponential decay rate of the excitation caused by an event of mark m.Vβ allows the decay rates to depend on the mark of the event causing the excitation.Hence, Sβ is formally nested in Vβ and results when β = β 1 = . . .= β M .The Mβ (matrix β) process results from ( 8), ( 9) and where β m→m |z is the decay rate of the excitation caused by an event of mark m on an event of mark m at location z.Specification (12) allows the decay rates to vary both with the pair of marks involved in the excitation and across locations, and allows the background mark probabilities δ and event conversion rates γ to vary across location.The Mβ model can be used to account for scenarios like those where a Corner event excites a Pass event in the short term and a Shot event in the longer term (β Corner→Pass|3 > β Corner→Shot|3 ).It also allows us to capture effects such as how a team is more likely to make more passes and retain possession of the ball in the defensive zone, while attempting more shots on goal in the attacking zone (γ Pass→Pass|1 > γ Pass→Pass|3 and γ Pass→Shot|3 > γ Pass→Shot|2 ).The final model we consider is the MβA (matrix β with abilities) where the baseline category logits of the conversion rates in ( 6) are driven by team information as In the above expression, φ m→m |z is a location-dependent baseline conversion, and c is the team in possession of the ball attempting the event conversion.The parameter ω cm , then, reflects the ability of a team to complete a conversion to an event of mark m.

Prior distributions
The shape and rate parameters of the Gamma distributions for the inter-arrival times in (8) are assigned independent exponential priors with rates a and b , respectively.The probability mass function for the locations specified in ( 9

Posterior distributions
The time and location conditional distributions corresponding to (8) and ( 9) share no parameters with each other, and no parameters with any of the conditional mark distributions for the Sβ, Vβ, Mβ, and MβA models.Furthermore, the likelihood in ( 7) can be factorised into a term depending only on the time parameters ζ, a term depending only on the location parameters η and a term depending only on the mark parameters θ.Given that the priors for ζ, η and θ are also independent, the derivation of, or sampling from, the posterior distributions can be performed separately for each of those parameters.
The priors for the location parameters η are conjugate, so the posterior for η is readily obtained.If y = {y i→j }, for j ∈ {1, . . ., Z}, are the observed counts of transitions originating from the state i where i ∈ {1, . . ., Z} × {1, . . ., M }, then the posterior distribution of each row of the transition matrix η i is a Dirichlet distribution with concentration parameters (y i1 + ν, . . ., y iZ + ν).
Posterior sampling for the parameter vectors a, b in (8) of the conditional distributions for the times, and the parameters θ of the conditional distributions for the marks in each of the Sβ, Vβ, Mβ, and MβA models is carried out using the variant of the Hamiltonian Monte Carlo procedure (Duane et al., 1987) that is implemented in Stan (Stan Development Team, 2020).
We have also implemented posterior sampling using a Metropolis-within-Gibbs procedure, which, though, proved to mix poorly in artificial data sets for the Sβ, Vβ, Mβ, and MβA, rendering it computationally infeasible.As in the case of Hawkes process, the poor mixing stems from the presence of strong correlations between the model parameters as well as the flatness of the likelihood function (Veen and Schoenberg, 2008).Stan, on the other hand, implements the No-U-Turn Sampler (Hoffman and Gelman, 2014) that automatically calibrates tuning parameters in a warm-up phase and can efficiently sample from complex posterior distributions.

Model complexity
The conditional mark distribution in the Mβ and MβA models involves a large number of parameters.There are M 2 Z decay rate parameters and M (M − 1)Z baseline conversion rate parameters which makes posterior sampling a computationally challenging task.We have developed a screening procedure based on association rule learning (see, for example, Agrawal et al., 1993) that operates on the data involved in the likelihood and eliminates parameters prior to posterior sampling.
The screening procedure retains only those parameters that capture the most significant event interactions and depends on two constants that need to be chosen.The first is the window size W for the number of transient events, and any event is allowed to be triggered by only one of the W events leading up to it.The other is the number of event pairs N considered in each of the three zones and sets a threshold on the number of significant event interactions that are identified.Full details on the association rule based screening procedure are given in Section S2 of the Supporting Materials.

Model evaluation
Let X (train) be the set of the training data on which the likelihood is based on, consisting of n (train) events, and let ψ (1) , . . ., ψ (R) be R samples from the posterior distribution.Denote by X (test) the set of held-out test data, consisting of n (test) events.
One method to evaluate the predictive accuracy of each model is to use the log point-wise predictive density (Vehtari et al., 2017) computed on the test data, using the posterior samples where ) is the likelihood of (t, z, m) given the filtration F t − at the posterior sample ζ (r) , η (r) , θ (r) .Large values of lpd indicate better predictive accuracy.Apart from Sβ, Vβ, Mβ, and MβA, we also evaluate the predictive accuracy of two simpler baseline models that do not include Hawkes-like excitation effects.The first baseline model, termed FOMC model, is based on the factorisation of the likelihood of marked spatio-temporal processes in expression (7) with models for the times and locations as in ( 8) and ( 9), respectively, but with the conditional probability mass function for the marks being a first-order Markov chain In this specification, θ (z,m)→m is the probability of the event mark m given that last observed event has location z and mark m.The second baseline model, termed MSTHP, is the marked where ρ mz is the Poisson rate parameter and q mz is the number of event occurrences for mark m at location z over a total observation time T in the data.The FOMC and MSTHP models have conjugate prior distributions and therefore their posteriors are readily obtained.Details on those prior distributions and the derivation of their posterior distributions are given in Section S1 of the Supporting Materials.

Training
Samples from the posterior distributions for the parameters of the Sβ, Vβ, Mβ, and MβA models of Section 5.2, and of the FOMC and MSTHP baseline models of Section 5.6 are obtained using all event sequences from the first 20 games of the league season, played between 17/08/2013 and 26/08/2013, which constitute X (train) .The training data involves S = 40 game periods involving of 27, 660 events.Each of the 20 teams participating in the league plays in two of the 20 games, one at their home and one at their away venue.As is also described in Section 2.1, there are M = 30 marks and Z = 3 zones.Table 5 gives the zone-wise event frequencies across marks in the training data used for modelling, reflecting the large variability in the frequencies both across marks and zones.
For the Mβ and MβA models, the association rule learning screening procedure of Section 5.5 is employed for all combinations of W ∈ {5, 10} and N ∈ {50, 100} to eliminate some of the model parameters and reduce the model complexity before posterior sampling.
The hyper-parameters for the prior distributions specified in Section 5.3 are as follows.The hyper-parameters a and b for the exponential priors on the parameters of the Gamma distributions in (8) are both set to 0.01.The Dirichlet prior on the background mark probabilities δ has concentration hyper-parameter δ = 1.The exponential prior on the decay rates β has a rate hyper-parameter β = 0.1.The Normal priors on the excitation factor α and the baseline-category logit model parameters φ and ω have hyper-parameters σ α , σ γ = 10.The location-specific background mark probability vectors in the Mβ and MβA models are assigned independent Dirichlet priors with concentration hyper-parameter δ = 1.
The ability parameters in the baseline category logit specification for the conversion rates of the MβA are not directly identifiable.In order to make them so, we set the abilities ω cm for West Ham United to 0 (m = 1, . . ., 30).Then, ω cm > 0 indicates that for team c, a previous event is more likely to trigger an event of mark m when compared to the reference team.
Samples from the posterior distributions are obtained by running four parallel chains using the Hamiltonian Monte Carlo procedures implemented in Stan.The Stan templates we used are all provided in the Supporting Materials.Each chain is initialised with different starting values and run for a total of 500 iterations post the warm-up phase.Table 6 gives posterior summaries along with convergence diagnostics for some of the parameters of the MβA model with W = 5 and N = 100; the corresponding chain-wise trace plots are provided in the Supporting Materials.
The convergence of the algorithm is assessed using the potential scale reduction factor R proposed by Gelman et al. (1992), which is the ratio of the average variance within each chain to the variance of the aggregated samples across chains.If the chains have converged to the stationary distribution, the expected value of R is 1.All parameters have R < 1.1, which, as recommended in Gelman et al. (1992), is evidence for convergence.Table 6 also gives the effective sample size (see, for example, Gelman et al., 2013, Section 11.5) for the samples for each of the posterior marginals, which indicate that the sampler returned samples with acceptable autocorrelation.For some parameters the effective sample size is larger than the sample size due to negative autocorrelations.This, as pointed out in Vehtari et al. (2021), is a consequence of the HMC algorithm used in Stan being an antithetic Markov chain which has negative autocorrelations on odd lags.The impact of the prior distributions in Section 5.3 on the posterior samples is minimal as seen, for a selection of parameters, in Figure 6 indicating that the posterior distributions of the parameters have concentrated after accounting for the likelihood.

Model evaluation
The Sβ, Vβ, Mβ, and MβA models are compared with each other and with the FOMC and MSTHP baseline models of Section 5.6 in terms of their log point-wise predictive density ( 14) computed on test data.The test data X (test) includes all events from the 5 games immediately following the games in the training data, played between 31/08/2013 and 01/09/2013.The test data involves S = 10 game periods involving of 27, 660 events.
Table 7 gives the log predictive densities lpd, summed over all the events in the 10 game periods in the test data.

Background mark probabilities
Table 8 gives the posterior means of the background probabilities δ m|z for all marks m ∈ {1, . . ., 30} and all locations z ∈ {1, 2, 3}.The background mark probabilities for the home and away team events in zone 1 are almost equal to those in zone 3 for the away and home team events, respectively.This is as expected because the attacking zone for the home team is the defensive zone for the away team and vice-versa.The similar probabilities for the home and away background mark probabilities could indicate that the background process of the game is not influenced by home advantage.To Table 7: Cumulative log posterior densities lpd over 10 game periods in the test data for all fitted models along with the number of estimated parameters d (par) in each model.For the Mβ models, W is the number of transient events and N is the number of significant event pairs identified in the rule-based framework for reducing model complexity.The formal method to test our hypothesis is to calculate the Bayes factor, defined as the ratio of the marginal likelihood of the constrained MβA model to the marginal likelihood of the full MβA model.Then a Bayes factor greater than 1 would indicate that there is no evidence in favour of the full MβA model and therefore, the background mark probabilities do not capture home advantage.However, as a consequence of both MβA models being highdimensional (∼ 1500 parameters), calculating their marginal likelihoods proved computationally infeasible.
As an alternative, for the constrained MβA model, we calculate its out-of-sample log predictive density on the same test data as carried out for all the other fitted models in Section 6.2.In fact, the constrained MβA model ( lpd = −21589.28)turns out with better predictive performance than the full MβA model ( lpd = −21599.81),supporting our claim that the background process of the game is not influenced by home advantage.
We also observe that the successful Pass events account for the majority of the background probability mass, while events like Shots and Goals have nearly zero probability.This suggests that the Shot and Goal events are highly unlikely to originate solely from the background component, but are instead triggered by excitations from previous events.

Excitation factor
The excitation factor α in expression ( 12) is a scaling factor applied to the contributions from the previous occurrences to the event mark probability.In ( 12), the background component has a weight of 1, while previous occurrences are weighted by exp(α).
The 95% highest posterior density interval for exp(α) is (451.35, 642.54), providing evidence that the contributions from previous occurrences carry substantially higher weight relative to the background component.In other words, this indicates that event sequences in football have a significant dependence on their history.

Decay rates
As mentioned in Section 4.2 the decay rate β m→m |z in expression ( 12) is the exponential decay rate of the excitation caused by an event of mark m on an event of mark m at location z.
By allowing the decay rates to depend on the pair of marks involved in the excitation, we had hoped to account for scenarios like a Corner event exciting a Pass S event in the short term and a Shot event in the longer term.Indeed, the 95% highest posterior density interval for β Home Corner→Home Pass S|3 is (1.34, 2.36) and β Home Corner→Home Shot|3 is (0.16, 0.44) illustrating that the Corner → Shot excitation decays at a much slower rate compared to the Corner → Pass S excitation.

Conversion rates
The parameter γ m→m |z in expression ( 12) is the probability the excitation from an event of mark m triggers an event of mark m at location z.Table 9 gives the posterior means and standard deviations of γ m→m |z in the midfield region (z = 2) for Manchester United.The probabilities for the Home Win → Home Pass S, Home Dribble → Home Pass S and Home Pass S → Home Pass S conversions are higher compared to their away team counterparts, indicating that Manchester United is better in retaining possession of the ball when playing at home compared to away.
Figure 7 provides a ridge-line plot of the log odds ratio for home versus away ability of a team to convert a Win → Pass S (Figure 7a) and Pass S → Pass S (Figure 7b).The teams are listed in decreasing order of the means of their respective posterior log odds ratios which are indicated by vertical lines.The percentage values by each plot, indicate the fraction of the Table 9: Posterior means and standard deviations (in parenthesis) of the event conversion probabilities γ m j →m i |z i corresponding to the location z = 2 from the MβA model for Manchester united.The γ m→m |z 's are computed by setting the team identifier c = 11, (corresponding to Manchester United), for both the home as well as the away events.The highlighted cells (in bold) illustrate the superior ability of Manchester United when playing at home compared to away.The dots (•) denote means and standard deviations less than 0.005.distribution greater than 0. All but two teams in (Figure 7a) and five teams in (Figure 7b) have greater than 50% of their distribution greater than 0, confirming that the vast majority of teams possess a higher ability to retain possession while playing at home.In this way, we not only confirm the well-known home advantage effect, but also quantify team performance for games played at home as well as away.

Team abilities
Figure 8 provides a ridge-line plot of the posterior distribution of the parameters ω c,Home Pass S and ω c,Away Pass S .The teams are listed in the decreasing order of the means of their respective posterior distributions which are indicated by vertical lines.We observe that Manchester United, the team with the highest ability to retain possession in home games (Figure 8a), drop significantly down in the rankings for the away games (Figure 8b).This is evidence that when Manchester United plays away they seem to deviate from the possession-based strategy they  Figure 9a provides a ridge-line plot of the posterior distribution of the cumulative ability of a team to attempt a shot on goal.A higher ω c,Home Shot , for example, indicates that for the team c, an event like Home Pass S is more likely to trigger a Home Shot.We do not expect the cumulative abilities ω c,Home Shot + ω c,Away Shot of the dominant teams to be high, as they might prefer to make additional passes to create better goal scoring opportunities.A weaker team, on the other hand, typically has fewer opportunities to attack and therefore, is more likely to attempt a shot on goal when possible.Indeed, this is what we observe in Figure 9, where we compare the team rankings based on their cumulative ability ω c,Home Shot + ω c,Away Shot with the number of shots per pass completed in the attacking third (S/P column in Figure 9b) in the training data.The comparison between Cardiff City and Norwich City is an interesting example of two teams that appear to be similar with 18 and 19 shots on goal attempted, respectively, in their two games in the training data.However, the two teams are at the opposite ends of the ranking based on their cumulative ability ω c,Home Shot +ω c,Away Shot , capturing the clear difference between their attacking styles.
Table 10a shows the team rankings based on the cumulative ability to trigger five different event types.For example, the Pass column ranks teams in the decreasing order of their posterior means of ω c,Home Pass S +ω c,Away Pass S .The teams are ordered in Table 10a by  the teams finished in the final league table of the 2013/14 season in Table 10b.

Event genealogy
The branching structure u si indicates whether the ith event in sth sequence is an "immigrant" (u si = 0) or an "offspring" of a previous event with index j (u si = j).Given an observed event sequence F stsn s , the conditional branching structure probabilities P(u si | F st si ) based on the model specification in expression (12) are The branching structure probabilities in (17) quantify the relative contributions of the background process and previous occurrences in the mark probability of the ith event in the sth sequence.Figure 10 shows the posterior means of the branching structure probabilities for all events in the first four minutes of the game between Chelsea and Hull City on 18/08/2013.To illustrate the flexibility of the model to account for dependence between events over arbitrary durations of time, we highlight the event Home Shot showing a higher probability of being an offspring of the event Home Out Corner than being an offspring of the more recent Home Pass S event.(a)

Model-based predictions
Finally, we illustrate how the mechanistic modelling framework presented in this paper can be used to simulate event sequences in football and obtain predictions of event probabilities in realtime.We split the game between Arsenal and Tottenham Hotspur (01/09/2013) in the test data into 30-second intervals.For each interval, given the history of events up to but not including the interval, we simulate events over the next 30 seconds Q = 100 times for each of the R = 500 posterior samples from the MβA model with the tuning parameter setting (W = 5, N = 100).
In Figure 11, we plot the proportion of all simulations within each interval where at least one Home Shot event was simulated, and use dotted lines to denote the intervals where a Home Shot event was actually observed.We also include a 10-step moving average model (MA 10) as a benchmark for comparison.We excluded the first 5 minutes of the game to ensure that we have can be tailored to separately model the components of the events in football, namely, the times, the locations and the event types.We were also able to incorporate team information into the model in a direct way that captured the relative abilities of the teams for each event type.We developed a method based on association rules to reduce the increased model complexity introduced by model extensions.The rule-based approach identified significant event interactions within sequences by placing thresholds on measures of significance.We then evaluated the accuracy of the excitation based models by comparing against two baseline models and confirmed the superior performance of the models with excitation effects.
We provided a detailed parameter description showing how the model parameters can be used to gain valuable insight into football.The excitation framework of the best performing model captured both the magnitudes and the durations of all pairwise event interactions across different locations.From the conversion rate parameters, we were able to quantify the well-known home advantage effect.We also discussed how the team ability parameters can be used to obtain rankings for the teams by event type, that can be used as predictors for team performance.The team ability parameters also captured some interesting differences in the playing styles of the teams, that were not immediately apparent just by looking at the data.In this way, the model along with its parameters can be used to develop a deeper understanding of the game-play by the coaching staff and inform strategic decision making.The proposed model can also be used to simulate the sequence of events in a game to obtain real-time predictions of event probabilities.We believe these predictions would enhance, among other aspects, the viewing experience of televised games.
The dataset we used consists of events from a single English Premier League season, which has a total of 380 games.However, as described in Section 6.1, we only used the first 20 games of the season as training data for the modelling exercise, over which every team in the league plays exactly one game each at Home and Away venues.This represents the minimum number of games required to ensure identifiability of all model parameters, specifically the team abilities.Even though the volume of data was kept to the minimum for computational reasons, our results illustrate that the model can provide valuable insights with limited data.So, the methodology developed in this paper can be readily applied to other team sports like rugby, hockey, basketball, American football etc where there may be fewer events per game or fewer games in a season.
Multiple seasons can be modelled together as if it were just one season using the modelling framework we propose, as long as the game rules, and hence, the definition of the events being considered does not change.Another aspect of the tournament to note is the relegation and promotion of teams within the league, which results in some teams not playing the same number of games over multiple seasons.A limitation of the proposed model is that the game periods are exchangeable, because the likelihood is invariant to the order in which the game periods and the games occur.It would be more natural to allow for the team ability parameters in (6) to be time-varying, especially over multiple seasons during which team players and managers are likely to change.Due to computational reasons we were not able to utilise most of the data even within a single season and current work focuses on overcoming this computational barrier using variational inference (Blei et al., 2017).
As none of the methods have been tailored specifically to football or even sports for that matter, they can be applied to a wide range of applications that generate event data streams.Specifically, the conversion rate parameters can be used to capture the triggering structure between different event types, for example, the probability of large earthquakes triggering smaller aftershocks.Also, the team ability parameters can be used for other multi-agent environments, for example, accidents by car type, countries in the analyses of financial events, individuals in identity systems, and so on.

Supporting materials
The raw data that motivated this work has been provided by Stratagem Technologies Ltd and consists of all touch-ball events for the 2013/14 season of the English Premier League.The raw data cannot be disclosed as the authors do not have the license to do so.The GitHub repository https://github.com/ForeStats/flexible-msttp-footballprovides all the computer code used for the data pre-processing and the Stan templates used to carry out the analyses presented in this paper.We also provide guidance on how the code can be used to apply our methods to similar data sets, like the publicly-available 2020/21 FA Women's Super League Data provided by StatsBomb Inc. at https://github.com/statsbomb/open-data.Section S1 of the Supporting Materials document provides details on the prior distributions and the derivation of their posterior distributions for the FOMC and MSTHP models.Section S2 gives details on the association rule based screening procedure used to identify the most significant event interactions.We also provide the chain-wise trace plots for some of the parameters of the MβA model with W = 5 and N = 100.Table S3: Posterior means of the homogeneous Poisson rates ρ m,z , for the first 5 marks.

S1.3 Baseline Markov chain model for the marks
The probability mass function for the marks specified in expression (15) of the main text, models the marks as a multinomial distribution given the current state (defined by the current location and the mark of the last observed event).Each row of the transition probability matrix θ, corresponding to a single state, is a set of multinomial parameters, one for each mark, that add up to 1. Similar to the model for locations in Section S1.1, let c = {c i→j }, for j ∈ {1, . . ., M }, be the count of observations of the transitions from the state i where i ∈ {1, . . ., M } × {1, . . ., Z}.Table S4 gives the observed counts of transitions from the first 5 states in the training data.
The likelihood of c given the multinomial parameters θ is where the sum of the probabilities, M j=1 θ i→j = 1.The conjugate prior for the multinomial distribution is the Dirichlet distribution, , where u i > 0 are the hyperparameters.The posterior distribution of θ i is therefore a Dirichlet with parameters u i + c i .We set u i to 1 and the resulting posterior means of the parameters θ i→j corresponding to the first 5 states are given in Table S5.

S2 Dealing with model complexity
The conditional mark distribution in the Mβ and MβA models involves a large number of parameters.There are M 2 Z decay rate parameters and M (M − 1)Z baseline conversion rate parameters which makes posterior sampling a computationally challenging task.We have developed a screening procedure that operates on the data involved in the likelihood and eliminates parameters prior to posterior sampling.
In the Mβ model, the decay rate parameters β and conversion rate parameters γ capture the duration and magnitude of the excitation effects between all pairs of event types.However, it is reasonable to assume that the matrices β and γ are sparse, because the excitation effects between all event pairs are not equally significant.To be precise, we expect most elements of the β matrix to be infinite, meaning the corresponding excitations decay almost instantaneously.For the γ matrix, we expect most its values to be zero, meaning the corresponding event conversions have probability zero.For example, a successful Pass event by one team cannot significantly excite a Pass event for the opposite team, as this would make the commonplace occurrence of a string of passes by a single team very unlikely.If we are able to identify the most significant pairs of event interactions, we can thereby limit the number of elements within the matrices β and γ that we need to estimate.

S2.1 Association rule learning
Association rule learning is a method for discovering strong relationships between variables in large databases (see, for example, Agrawal et al., 1993).For example, the association rule Bread ⇒ Butter identified from a supermarket sales database would indicate that if a customer buys bread, they are also likely to buy butter.The objective of association rule learning is to identify rules that are interesting based on some measure of significance.

S2.2 Definition for event sequences
Inspired by the original definition in Agrawal et al. (1993, Section 2), we define the problem of association rule learning in the context of event sequences as Definition S2.1.
• Let A = {1, . . ., M } be the set of M distinct event types.
• Let B = {b sn }, where b sn ∈ A for s ∈ {1, . . ., S} and n ∈ {1, . . ., N s }, be the training data consisting of S event sequences with N s number of observed events in the sequence s.
• Construct a database of subsequences D = {d 1 , . . ., d C }, where C = S 1 N s , such that each event b in B has a corresponding subsequence of length W + 1 in D, made up of b and the W events preceding b.

Figure 2 :
Figure 2: Heat maps showing the density of ball-touches for Arsenal and Chelsea in their home and away games in the 2013/14 season.In all heat maps the team is attacking to the right.

Figure 4 :
Figure4: The K function estimate K(t) − 2t of the observed event times (black points) from the first game of the season between Aston Villa and Arsenal.Hawkes I (green) is a Hawkes process with parameters (µ, , β) = (0.4183, 0.0035, 0.0004) estimated from the observed times using maximum likelihood.A Hawkes process with = 0 is the trivial case with no excitation that reduces to a Poisson process (orange) with estimated rate µ = 0.4189.Hawkes II (pink) has parameters (µ, , β) = (0.2594, 0.4, 0.01) and Hawkes III (purple) has parameters (µ, , β) = (0.1068, 0.8, 0.01).Hawkes II and Hawkes III are examples of processes with moderate and severe clustering respectively, whose µ parameters were estimated from the observed times using maximum likelihood after fixing , β.The box plots of the estimates for the Hawkes and Poisson processes were computed using 100 independent simulations of each process over the same time interval as the observed times.
) models the locations as a multinomial distribution given the current state of the Markov chain.The conjugate prior for the multinomial distribution is the Dirichlet distribution (see, for example,Gelman et al., 2013, Section 3.4) and therefore we assign a Dirichlet prior on the multinomial probabilities η with a common concentration rate parameter ν.The background mark probability vector δ in the Sβ, Vβ, Mβ and MβA models is also assigned a Dirichlet prior with concentration hyper-parameter δ .The location-specific mark probability vectors (δ 1|1 , . . ., δ M |1 ) , . . ., (δ 1|Z , . . ., δ M |Z ) in the Mβ and MβA models are assigned independent Dirichlet priors with concentration hyper-parameter δ .The excitation factor α is assigned a normal prior with mean 0 and standard deviation σ α .The decay rate parameter β in the Sβ model, the parameters β 1 , . . ., β M in the Vβ model, and their locationspecific counterparts in the Mβ and MβA models are assigned independent exponential priors with a common rate β .The parameters φ m→m |z and ω cm (m, m = 1, . . ., M ; z = 1, . . ., Z; c = 1, . . ., C) in the MβA model are assigned independent Normal priors with mean 0 and standard error σ γ .The conversion rate parameters (γ m→1 , . . ., γ m→M ) in the Sβ and Vβ models, and their location-specific counterparts in the Mβ model are assigned independent Dirichlet priors with a common concentration rate parameter γ .

Figure 6 :
Figure 6: Visualising the impact of prior specifications by overlaying the posterior and prior densities for selected model parameters for the MβA model with (W, N ) = (5, 100).
we fit another MβA model after constraining all the corresponding home and away background mark probabilities to be equal, for example, δ Home Pass S|1 = δ Away Pass S|3 , δ Home Foul|3 = δ Away Foul|1 , δ Home Dribble|2 = δ Away Dribble|2 and so on.The constrained MβA model has 45 fewer parameters to be estimated as compared to the full MβA model.
Figure 7: Posterior distribution of φ Home Win→Home Pass S|2 + ω c,Home Pass Sφ Away Win→Away Pass S|2 -ω c,Away Pass S in (a) and φ Home Pass S→Home Pass S|2 + ω c,Home Pass Sφ Away Pass S→Away Pass S|2 -ω c,Away Pass S in (b), from the baseline logit specification for incorporating team abilities in (13).Interpreted as the relative ability of a team to convert a Win to a Successful Pass when playing at home compared to away in (a) and similarly from one Successful Pass to another Successful Pass in (b).Teams are ranked in the decreasing order of the means of their respective posterior distributions shown by the overlaid vertical lines.

Figure 8 :
Figure 8: Posterior distribution of the parameters ω c,Home Pass S in (a) and ω c,Away Pass S in (b), from the baseline logit specification for incorporating team abilities in expression (13).Teams are ranked in the decreasing order of the means of their respective posterior distributions shown by the overlaid vertical lines.

Figure 9 :
Figure 9: (a) Posterior distribution of ω c,Home Shot + ω c,Away Shot , the cumulative ability of a team c, relative to West Ham (baseline), to attempt a shot on goal.(b) The number of shots, passes completed in the attacking third and shots per pass completed in the attacking third (S/P) for each team in the training data.Table 10: (a) Team rankings based on the cumulative ability to trigger a particular event type.For example, the column Pass ranks teams in the decreasing order of their posterior means of ω c,Home Pass S +ω c,Away Pass S .The teams are ordered in (a) by the rankings based on their passing ability, which is a good indicator of the final position in the league table of the 2013/14 season in (b).

Figure
Figure S1: Chain-wise trace plots for some of the parameters of the MβA model with W = 5 and N = 100.

Table 2 :
Frequencies of the 22 distinct types of touch-ball events in the data.

Table 3 :
Composite event types along with their labels and observed frequencies in the data.

Table 4 :
Snapshot of the final data prepared for modelling where each event, indexed by i = 1, ..., n, consists of the following components, the time of occurrence t i , the zone z i , and the mark m i .The home and away team information for each game is assumed to be known and the first event (t 1 , z 1 , m 1 ) in each game period is considered to be deterministic and therefore, not modelled.
iid period team id time (t i ) zone (z i ) mark (m i )

Table 5 :
Zone-wise frequencies for each event type in the training data.

Table 6 :
Posterior summaries and convergence diagnostics from 2000 posterior samples for selected parameters from the MβA model after screening with (W, N ) = (5, 100).
Table 7 also provides the number of parameters in each model as a

Table 8 :
Posterior means and standard deviations (in parenthesis) of the zone dependent background mark probabilities δ m|z for z ∈ {1, 2, 3} from the MβA model.The dots (•) denote means and standard deviations less than 0.005.
m sj →m si |z si (t si −t sj ) γ m sj →m si |z si δm si + t sk <t si e α−β m sk →m si |z si (t si −t sk ) γ m sk →m si |z si for t sj < t si 0 for t sj ≥ t si .
Posterior means of branching structure probabilities for events in the first 4 minutes of the game between Chelsea and Hull City on 18/08/2013.The highlighted event Home Shot has a higher probability of being an offspring of the event Home Out Corner than being an offspring of the more recent Home Pass S event.Forecasting the probability of observing at least one Home Shot event in 30-second intervals during the game between Arsenal and Tottenham Hotspur (01/09/2013) in the test data.Intervals with observed Home Shot events are highlighted using dotted lines.MA 10 is a 10-step moving average model used as a benchmark for comparison.

Table S4 :
Transition counts c i→j from the first 5 states to the first 5 marks in the training data.We abbreviate the prefix Home to H in the mark labels.

Table S5 :
Posterior means of the multinomial parameters θ i→j corresponding to the first 5 states.We abbreviate the prefix Home to H in the mark labels.