Measuring diachronic sense change: new models and Monte Carlo methods for Bayesian inference

In a bag-of-words model, the senses of a word with multiple meanings, e.g."bank"(used either in a river-bank or an institution sense), are represented as probability distributions over context words, and sense prevalence is represented as a probability distribution over senses. Both of these may change with time. Modelling and measuring this kind of sense change is challenging due to the typically high-dimensional parameter space and sparse datasets. A recently published corpus of ancient Greek texts contains expert-annotated sense labels for selected target words. Automatic sense-annotation for the word"kosmos"(meaning decoration, order or world) has been used as a test case in recent work with related generative models and Monte Carlo methods. We adapt an existing generative sense change model to develop a simpler model for the main effects of sense and time, and give MCMC methods for Bayesian inference on all these models that are more efficient than existing methods. We carry out automatic sense-annotation of snippets containing"kosmos"using our model, and measure the time-evolution of its three senses and their prevalence. As far as we are aware, ours is the first analysis of this data, within the class of generative models we consider, that quantifies uncertainty and returns credible sets for evolving sense prevalence in good agreement with those given by expert annotation.


Introduction
As a natural language evolves, the meanings of words within the language change.The field of diachronic lexical semantics is concerned with how word meanings change over time.Words with multiple meanings or senses, and their time-evolution, are of considerable interest within the field.Examples of such words include "mouse" (meaning a rodent or a computer pointing device) and "bank" (meaning a river-bank or a financial institution).Statistical models of diachronic sense change are useful for lexicographic and linguistic research as well as for downstream applications in many natural language processing (NLP) tasks such as information retrieval.
For a word with multiple senses, the intended sense is usually apparent from the context.For example, the different senses of "bank" are obvious in the text snippets "deposited £500 in his bank account" and "plants growing on the bank of the Indus river".We expect certain context words to be used more often than others depending on the intended sense of the target word.In the "bank" example for instance, context words such as "money" or "account" are more likely when "bank" is used in the financial institution sense, whereas context words such as "river" or "stream" are more likely in the river-bank sense.We may also expect changes in the relative frequency of context words over time.E.g. for the financial institution sense of "bank", the context word "specie" (coin) is more likely to be used up to the early 20 th century whereas the context word "card" is more likely to be used later on.The prevalence of a sense itself may change over time, e.g. the pointing device sense of "mouse" increased in prevalence over the later half of the 20 th century.We are interested in a model that captures all of these features.
In this paper we analyse (as a test case) the sense change for the ancient Greek word "kosmos", meaning decoration, order or world, and quantify the uncertainty in these sense change estimates.Obtaining usefully narrow credible intervals with good coverage is no easy task and, viewed from a statistical perspective, this is our main contribution to the field.We achieve this by careful statistical modelling of the data.We develop a new model of Diachronic Sense Change (DiSC) by adapting the Sense ChANge (SCAN) and Genre-Aware Semantic Change (GASC) models of Frermann and Lapata (2016) and Perrone et al. (2019) respectively.Under this modelling framework, target word senses are represented as probability distributions over context words, sense prevalence is represented as a probability distribution over senses, and both sense and sense prevalence have temporal dependence.Our DiSC model has significantly fewer parameters than SCAN/GASC.However, there is no evidence for any loss of goodnessof-fit, and our model is significantly easier to analyse.Both aspects are important when the word of interest occurs infrequently in a very large fixed corpus of surrounding text as is the case for "kosmos" in the ancient Greek data.We found that we could only give a reliable and well-calibrated fit for SCAN/GASC when the data were strongly informative of the parameters.We could not find any variant of MCMC that could reliably fit SCAN/GASC to the "kosmos" data even for point estimation of parameters, let alone uncertainty quantification.On the other hand, we were able to fit DiSC to the data, quantify uncertainty, and get a well-calibrated match to the expert sense-annotation.Moreover, in experiments on synthetic data generated according to the SCAN model itself, DiSC scores at least as well as SCAN on sense-labelling, as measured by Brier scores.We attribute this to a favourable bias-variance tradeoff.
These models are related to topic models, though there is no direct correspondence.Variational methods are widely used to fit topic models and can be efficient for inferring posterior means.However, variational methods are less reliable for quantifying uncertainty, and in particular tend to underestimate variance.Markov Chain Monte Carlo (MCMC) methods are particularly challenging for the models under discussion.However, where MCMC methods can be shown to converge, they at least give asymptotically exact posterior summaries.We present a relatively efficient method for fitting these models in such cases, where we marginalise the posterior by summing over the discrete sense labels, and use gradient-based MCMC schemes to target the remaining continuous distributions.We use the occurrences of "bank" in an English text corpus, as a simple illustrative example where all analyses are possible, to compare our sampler against existing MCMC samplers for these models.We compare the models' predictive performance on held-out sense labels for "bank" and for synthetic datasets.We then analyse "kosmos" using our new model and MCMC sampler.
The rest of this paper is structured as follows.In Section 2 we review the existing approaches for modelling diachronic semantic change and parameter inference within the NLP literature.In Section 3 we describe the "bank" and "kosmos" datasets, respectively from the English and ancient Greek corpora.In Section 4 we present our new DiSC model and highlight the differences with the existing SCAN and GASC models.In Section 5 we describe the existing MCMC samplers targeting the posterior, and present our MCMC scheme.Section 6 compares the performance of the MCMC samplers on the SCAN model for "bank".Sections 7-9 look at the application of the models to the English, synthetic and ancient Greek datasets.Section 10 concludes with a discussion of limitations and possible future research.Some further technical details and results are given in the appendices, and R scripts for data extraction and model-fitting are uploaded to GitHub.

Related work
The problem of modelling diachronic semantic change has been approached in several different ways within the field of NLP, and detailed overviews of the literature are given by Tahmasebi et al. (2018) and Tang (2018).Broadly, the approaches can be categorised into three groups: topic-based models, graph-based models and embedding models.
The topic model first introduced by Blei et al. ( 2003) is a generative model called Latent Dirichlet Allocation (LDA).Under LDA, a document of given length is generated by sampling a topic, and then a word given the topic, at each position in the document.LDA is a simple bag-of-words model that captures some basic ideas of meaning via the word-topic associations.The model uses Dirichlet priors for probability distributions over topics and words, and variational inference is commonly used for parameter estimation.Griffiths and Steyvers (2004) give a collapsed Gibbs sampler targeting the marginal posterior integrated over the continuous parameters, which can be used for asymptotically exact inference.
LDA was extended by Blei and Lafferty (2006) to give a dynamic topic model which additionally captures the time-evolution of topics.The dynamic topic model uses logistic normal priors since these are straightforward to model as a time series, as opposed to the parameters of the Dirichlet priors under LDA.Variational inference is used for the parameters, but alternative sampling methods have been proposed in the literature.These include the blocked Gibbs sampler based on auxiliary uniform variables given by Mimno et al. (2008) and a strategy based on auxiliary Polya-Gamma (Polson et al., 2012) variables given by Chen et al. (2013).
The dynamic topic model was adapted by Frermann and Lapata (2016) to explicitly capture the meanings or senses of a given target word (as opposed to topics in documents) and their time-evolution in the SCAN model.The main distinction between the two is that a topic model has an independent topic underlying each context word, whereas in SCAN all context words for a single usage of the target word share the same sense.The logistic normal priors, defined separately in each time period, are connected to their temporal neighbours via an intrinsic Gaussian Markov Random Field (Rue and Held, 2005), which enables modelling the change in adjacent parameters without requiring a global mean.GASC, an extension to SCAN given by Perrone et al. (2019), additionally allows the prevalence of each sense to vary according to the genre of the text in which the target word is used.Both authors use the Gibbs sampler of Mimno et al. (2008) for inferring the model parameters.
A distinct graph-based approach to this problem is given by Mitra et al. (2014Mitra et al. ( , 2015) ) who use a semantic network model in which words are represented as nodes, and edges between nodes denote word co-occurrence in a sentence.The senses of a word are clustered separately for two different time periods, and then compared across the time periods to identify sense changes as well as sense births, deaths, mergers and splits.A similar approach is used by Tahmasebi and Risse (2017) who cluster senses separately for each time period and track the clusters through time.
Word embeddings are techniques for mapping words onto low-dimensional real vector spaces.In recent years, the neural-network-based Word2vec models developed at Google by Mikolov et al. (2013) have emerged as the most popular word embedding models, although Stanford's GloVe model developed by Pennington et al. (2014) is a popular alternative based on a factorisation of the global co-occurrence matrix.Skip-gram is the more popular of the two Word2vec models (the other being CBOW) and has been used to capture many semantic word relationships (Mikolov et al., 2013a) and linguistic regularities (Mikolov et al., 2013b), but the original model only uses one vector representation for each word and hence does not allow for multiple senses.The original Skip-gram has been extended in several ways to capture multiple senses per word, e.g. the Adaptive Skip-gram model given by Bartunov et al. (2016) and the loss driven multi-sense identification (LDMI) model given by Manchanda and Karypis (2019).A comprehensive review of embedding techniques used for word sense representation is given by Camacho-Collados and Pilehvar (2018).
Models based on word embeddings have been used to track semantic changes over time.These models usually construct the embeddings separately in each time period and then align the vectors across time, e.g. as done by Hamilton et al. (2016) and Kulkarni et al. (2015).Alternative approaches are given by Dubossarsky et al. (2019) who use temporal referencing instead of alignment, and by Rudolph and Blei (2018) who use dynamic embeddings (based on exponential family embeddings (Rudolph et al., 2016)) where word embeddings are set in a probabilistic framework.A different dynamic embedding model (called dynamic Skip-gram) is given by Bamler and Mandt (2017) who use a Kalman filter prior to connect embeddings across time periods.
It is not straightforward to ascertain which of the three approaches, if any, is the best.Whilst word embedding models appear to be the most popular category for semantic modelling (Kutuzov et al., 2018), these models tend to admit either multiple word senses or multiple time periods but not both simultaneously.To the best of our knowledge, there is currently no embedding model that allows multiple word senses to be modelled consistently across time.The dynamic embedded topic model (Dieng et al., 2019) tracks document topics, but not word senses, across time using word embeddings; so although it is not in substance a model for sense change, it is a step in this direction.Moreover, the embedding and graph-based models are not stochastic-process-based generative models, and this limits their interpretability and Bayesian measures of uncertainty.In contrast, the topic-based SCAN and GASC models are generative, admitting both multiple word senses and multiple time periods, and the model parameters have simple physical interpretations.The main drawback of SCAN and GASC is that they over-parameterise when the interaction between sense and time is weak or weakly evidenced by the data.They are particularly difficult to fit on sparse and noisy data, which are common, and where the parameterisation leads to ridge structures and multi-modality in the posterior.Our model, with an additive effect of sense and time, offers a lower-dimensional alternative.Quantification of uncertainty in sense change estimates is rare in the field of semantic change detection, and indeed has not been attempted by the authors of SCAN and GASC whose work we build upon.The participating models in the recent SemEval shared-task on semantic change detection (Schlechtweg et al., 2020) give a snapshot of current practice.Few models, if any, attempted to quantify uncertainty.It was not a SemEval assessment criterion, as is typical in this literature.

Data
Consider the three text snippets containing the word "bank" in Table 1, where "bank" is used in the sense of a river-bank in the first example, and in the sense of a financial institution in the other two.The first two snippets, written in the time period 1990-2010, are taken from the non-fiction genre, whereas the third snippet, written in the time period 1830-1850, is taken from the magazine genre.The snippets are of equal length with 7 words on either side of "bank".The words in blue are stopwords, i.e. the most common words in the language, which generally do not contribute to the meaning of the target word if we ignore syntax.The words in orange on the other hand appear with a low frequency (called "hapaxes" if they appear exactly once), and are uninformative in the context of the models we consider.
For a given target word, we define the data W as a collection of D snippets with a symmetric context window of L/2 words on either side of the target word (ignoring sentence and paragraph boundaries), with the stopwords, uninformative words and punctuation removed, and the words lemmatised (i.e.replaced with root words such as "wash" instead of "washing" in the first example).The snippets span multiple discrete and contiguous time periods, and may be taken from any number of text genres.Our model, described in the next section, could be applied to any dataset with these features.We analyse two real datasets in this paper: the context data for target word "bank" extracted from the Corpus of Historical American English (COHA) published by Davies (2010), which is used as an illustrative example, and the context data for target word "kosmos" extracted from the Diorisis Ancient Greek Corpus published by Vatri and McGillivray (2018), which is the focus of this work.We additionally use synthetic data to compare the models' predictive performance on held-out true sense labels.The "bank" example was used by Frermann and Lapata (2016), and we use it as an illustrative example due to its relative simplicity since the two main senses of "bank" are very distinct.There are c.74,000 instances of "bank" in COHA covering the years 1810-2009, which we divide into ten 20-year contiguous blocks, and spanning four genres (fiction, non-fiction, news and magazine).We extract snippets of length L = 14 words (not counting the target word) around these instances.The snippet length has to strike a balance between including meaningful nearby words and not including noise from distant words, and we found the length L = 14 to be sufficient for a human to identify the sense in most cases.For computational ease, we randomly subsample a maximum 100 snippets per genre-time block, giving us 3,685 selected snippets and c. 70,000 nonselected snippets.We manually tag 3,525 of the selected snippets with the correct sense of "bank", grouping together the related meanings of river-bank, edge, tilt or heap, and the related meanings of a financial institution or a store (e.g.blood bank).The remaining snippets were either ambiguous or used "bank" as a proper noun (e.g.Mr Banks), or a very small number of other senses.We identify and remove stopwords using the R package Stopwords (Benoit et al., 2020) as well as part-of-speech tags, marking anything other than nouns, adjectives, verbs and adverbs as stopwords.We further restrict the data to the top 70% most frequently occurring words in the non-selected snippets.This is done for efficiency reasons, and seems to incur little loss of information.We refer to the 30% of words omitted as "uninformative" words: they occur infrequently in the target context.The observation model for context words in the selected snippets is not affected by this registration, as uninformative words are defined by their frequency in non-selected snippets.Using this registration criterion, we have a vocabulary of 973 words.The "kosmos" example was used by Perrone et al. (2019), and we use it as our test case because, in contrast to "bank", we found it particularly challenging to analyse using existing models and tools."Kosmos" can be used in one of three senses, i.e. decoration, order or world, and expert sense-annotation is provided by Vatri et al. (2019).We use snippets of length L = 14 as before and, following Perrone et al. (2019) for testing purposes, retain only the "collocates", i.e. all snippet instances where an ancient Greek expert was able to identify the sense based on contextual information alone.A "realuse" analysis including non-collocates is given Appendix F. We group the text genres into "narrative" and "non-narrative", and divide time into 9 contiguous centuries from 700 BC to AD 200.We remove stopwords in the same way as for "bank" plus the additional stopwords identified by Berra (2018).Hapaxes (i.e.words that appear only once in the selected snippets) are removed, approximating the observation model as we do not condition on the event that a context word appears at least twice over all selected snippets.This is in contrast to the registration process for "bank" where we had a large set of non-selected snippets.This leaves us with 1,144 snippets and a vocabulary of 968 words appearing in snippets associated with "kosmos".Tables 2-3 show that the "kosmos" data is a lot more sparse and fragmented than the "bank" data, especially for the early time periods.

Prior and observation models
In this section we introduce our new DiSC model and highlight how it improves upon the existing GASC model, noting that GASC is the same as SCAN except that it allows the sense prevalence to vary according to the text genre in addition to time.DiSC, given in Algorithm 1, is a generative model of how the context words in the snippets around a given target word are emitted from a latent stochastic process.We make the simplifying assumption commonly used in NLP that each snippet is a "bag-of-words" (i.e.word order and grammar are ignored) of length L, not counting the target word itself.Stopwords and uninformative words are generated in any snippet d with probabilities q SW and q U respectively, so the number of context words that are neither stopwords nor uninformative, denoted L d , has a binomial distribution L d |L, q SW , q U ∼ Bin(L, 1−q SW − q U ).The context positions occupied by these words are a random subset {i 1 , . . ., i Ld } of size L d drawn from {1, . . ., L}, i.e. the order of words is irrelevant as per the bag-ofwords assumption.Once we condition on the L d values, the likelihood (which we write down in Section 5 below) does not depend on q SW and q U .We can therefore drop q SW and q U from all analysis hereafter.
The snippets span T discrete and contiguous time periods and G text genres, so we have deterministic mappings between each snippet d and its time and genre labels τ d ∈ {1, . . ., T } and γ d ∈ {1, . . ., G} respectively.The target word can be used in one of K senses in any snippet d, so the single sense assignment z d for the whole snippet is emitted as a draw from a multinomial distribution over the senses {1, . . ., K} parameterised by the K-dimensional probability vector φγd,τd , that is z d | φγd,τd ∼ Mult φγd,τd 1 , . . ., φγd,τd K .The vocabulary consists of V words so, given the sense assignment z d , for each context position i ∈ {i 1 , . . ., i Ld } the context word w d,i is emitted as a draw from a multinomial distribution over the words {1, . . ., V } parameterised by the V -dimensional sensedependent probability vector ψzd,τd , that is w The probability vectors φg,t and ψk,t are softmax transforms of the real-valued sense prevalence parameter vector φ g,t and word parameter vector ψ k,t respectively, i.e. φg,t = exp(φ g,t ) . (1) Here φ, φ are K × G × T dimensional arrays and ψ, ψ are V × K × T dimensional arrays.
The latent variables φ and ψ are not identifiable in this setup, but this is not an issue since these are only used to determine the priors for the probability arrays φ and ψ which are the variables of interest.The word parameter vector ψ k,t , which depends on both the sense k and the time t, is defined as the sum of a word-sense parameter vector χ k and a word-time parameter vector θ t .We place independent autoregressive AR(1) priors on the elements of φ g,t and θ t , and independent normal priors on the elements of χ k , as defined in Algorithm 1 lines 2-15.The prior hyperparameters are kept fixed.
The GASC generative model is given in Algorithm 2 in Appendix A. Our DiSC model differs from GASC in three main ways.The fundamental difference is that we assume the effects due to sense and time are additive: the 3-dimensional V ×K ×T array ψ in GASC is replaced by two 2-dimensional arrays in DiSC, i.e. a V × K array χ and a V × T array θ, so that we have ψ k,t = χ k + θ t for k ∈ {1, . . ., K} and t ∈ {1, . . ., T }.This reduces the dimension of ψ from V KT parameters in GASC to V (K + T ) parameters in DiSC.The implications of this for sense change measurements are discussed in Section 4.1.
The second difference is in the modelling of the time series variables φ g,t , θ t in DiSC and φ g,t , ψ k,t in GASC.Our priors on the time series are autoregressive AR(1) processes with proper stationary distributions, whereas GASC uses improper priors without global means or stationary distributions.For example, the prior distribution of . ., T }, and an improper uniform distribution over all real numbers for t = 1.It is thus possible for the posterior φ g,t k |W to drift off to ±∞ since there is no global mean to tether the distribution.In contrast, the DiSC priors have a stationary distribution φ g,t k |κ φ , α φ ∼ N 0, κφ 1−(αφ) 2 for all t.We expect this to lead to more homogeneous behaviour at the beginning and end of the time series.
draw word-sense parameter χ k |κ χ ∼ N (0, diag(κ χ )) 15: end for 16: for sense k ∈ 1 : K and time t ∈ 1 : T do 17: set word parameter ψ k,t = χ k + θ t 18: end for 19: using softmax (1), transform real arrays φ and ψ into probability arrays φ and ψ --------OBSERVATION MODEL --------20: fix probabilities of drawing stopwords q SW and uninformative words q U 21: for snippet d ∈ 1 : D do 22: draw number of context words L d |L, q SW , q U ∼ Bin(L, 1 − q SW − q U ) 23: draw a random subset {i 1 , . . ., i Ld } of size L d from {1, . . ., L} The third difference is that whilst we treat the prior hyperparameter κ φ as fixed, GASC uses a random hyperprior κ φ ∼ Inv Gamma(a, b) with a, b fixed.However, the data do not inform this parameter at all well: κ φ is conditionally independent of the data W given φ, we have a relatively small number of φ parameters, and these are in turn conditionally independent of the data W given the unknown sense labels z.In contrast, physical considerations lead to an informative prior for κ φ .The joint posterior (2) below is insensitive to the choice of κ φ for any plausible value of this parameter, so we can fix κ φ without loss using the prior elicitation described in Section 4.2.

Sense representation and sense change
Following the framework of Frermann and Lapata (2016) for these bag-of-words models, we define sense k of the target word at time t as the distribution ψk,t 1 , . . ., ψk,t V over words {1, . . ., V } appearing in the context of the target word."Sense change" may therefore be defined as the evolution of the V × K matrix ψ•,t over t ∈ {1, . . ., T }.Similarly, "sense prevalence change" may be defined as the evolution of the K ×G matrix φ•,t over t ∈ {1, . . ., T }.We loosely refer to both evolutions as "sense change" for brevity.This definition of sense change does not correspond to any sort of real meaning change (which we do not define, but we have in mind something like changes in the dictionary definition of the target word), which is a limitation of all bag-of-words models.
We parameterise DiSC to model key structural drivers of sense difference (over senses) and sense change (over time).The additive main effects in DiSC (χ and θ respectively) correspond to these drivers.This allows us to model the main effects, and the data to inform them, in a direct way.In contrast, GASC models the effects and their interaction in a general way without an explicit parameterisation or modelling of the main effects.Useful structural information is left out of the prior, as the model imposes no core additive structure.We describe below the main data features that we see as informing sense difference and sense change.
The probability ψk,t v of context word v ∈ {1, . . ., V } being used in a snippet can change when the background usage frequency of word v changes in the text corpus taken as a whole.For example, the background usage frequency of the word "telephone" likely increases over the 20 th century.It is then more likely to appear in any context, across all target word senses.This is, in our definition, a form of sense change.We expect this time-effect, captured by θ t in DiSC, to be a common driver of sense change in these models.The time-effect is not explicitly modelled in GASC.
Similarly, we expect certain words to appear more frequently in the context of particular senses of the target word regardless of their background usage frequency.Examples include "water" and "money" for the two senses of "bank" respectively.This relatively strong sense-effect informs sense difference for the target word, and is basic to the bagof-words setup.DiSC, in contrast to GASC, models this sense-effect explicitly through χ k .
On the other hand, we might expect certain words to change in frequency at different rates in the context of different senses of the target word.Such changes might be driven by actual changes in the meaning of the target word.For example, "telephone" might increase in frequency more rapidly in the financial institution sense of "bank", as the sense changes to reflect more modern ways of banking, than it does in the riverbank sense.This sense-time interaction effect is captured by GASC but not by DiSC.However, GASC does not distinguish between the main and interaction effects, so if we are interested in isolating this behaviour then it is necessary to remove the main effects.We illustrate this in our analysis of the "bank" data in Section 7 below.
Goodness-of-fit is generally retained in DiSC even in the presence of a real interaction.Words contributing to poor fit due to missed interactions are words that both appear frequently in the context of multiple target word senses and evolve in frequency at different rates, since these have both large percentage change and large value.Large percentage errors in context probability for very small context probabilities are of little consequence.This feature of the data makes DiSC relatively robust to interaction in practice, as demonstrated in the synthetic data experiments in Section 8 below.
The additive structure in DiSC helpfully prevents switching of sense labels between time periods.In MCMC targeting GASC, the senses can settle into different sense-label permutations in different time periods, as ψ-values are only loosely connected temporally via their priors.This is a particular problem when some context words appear frequently across more than one sense.In such cases, the MCMC may get stuck in a local mode where the sense labels are not aligned across time.This is no criticism of GASC, but adds to the challenge of fitting it to data.
One basic and important kind of sense change, i.e. the emergence of a new sense in addition to the existing senses, is captured in both models if K is large enough to accommodate the new sense.The new sense k exists at all times, but its prevalence φg,t k changes from being very small to significant as it emerges with increasing t.

Hyperparameter settings
We choose the number of senses K equal to the smallest value such that each sense k ∈ {1, . . ., K} in the model output, as identified by the most probable words under that sense, is distinct and meaningful in the judgement of the expert user.This is practical when DiSC is used as an exploratory tool to discover the lifespan and usage rate of distinct senses without the need for hand-labelling.Alternatively, the expert user may have hand-labelled a small set of snippets, in which case we would have prior knowledge of K. Setting K therefore requires a few trial runs.In our case, K = 2 for "bank" and K = 3 or 4 for "kosmos" depending on the task (see Section 9 and Appendix F).If the value of K is still in question, model selection tools may be used, for example classic Bayesian tools such as Bayes factors (Kass and Raftery, 1995) or reversible jump MCMC (Green, 1995), or rather the state-of-the-art on these such as Xing (2021) or Karagiannis and Andrieu (2013) respectively.However, we have not investigated this.
Implementing SCAN/GASC requires setting the hyperparameter ), κ ψ and the hyperparameters a, b in κ φ ∼ Inv Gamma(a, b).Frermann and Lapata (2016) used the setting κ ψ = 0.1, a = 7 and b = 3 for SCAN whereas Perrone et al. (2019) used κ ψ = 0.01, a = 1 and b = 1 for GASC.We report results for the SCAN choice as these give the best performance for these models.
For the AR(1) processes in DiSC, we use the parameters α φ = α θ = 0.9 -a high value -in order to have weak mean reversion, so that we have a proper process prior without unduly influencing the posteriors.
To set κ φ in DiSC, we elicit a prior by defining what we consider to be an extreme sense prevalence difference.The number of senses K is relatively small (compared to the vocabulary size V ).Taking two fixed senses l, m ∈ {1, . . ., K}, we allow a difference as large as φg,t l / φg,t m ≈ 100 to be possible but extreme.This can easily be adjusted depending on the data-modelling context.On the logit-scale, we therefore assert that φ g,t l − φ g,t m > log 100 is a 3-sigma event.From the prior stationary distribution in the AR(1) process, V(φ g,t l − φ g,t m ) = 2κφ 1−(αφ) 2 ; so we express our preference with 3 2κφ 1−(αφ) 2 1 2 = log 100, giving κ φ = 0.25 on rounding.There is no simple comparison with κ φ in SCAN/GASC due to the AR(1) structure in DiSC.If we replace the threshold for a rare event with φg,t l / φg,t m ≈ 10 6 then we find κ φ ≈ 2: the standard deviation changes over a small range from 0.5 to √ 2. Larger values are hard to justify.The vocabulary size V is typically quite large (c.1,000 in our examples), so we might expect greater variation in the probabilities for context words in a given sense, perhaps by a factor of of 1,000.That is, for any fixed time t, sense k and pair of words x, y ∈ {1, . . ., V }, the ratio of context word probabilities might be as large as ψk,t x / ψk,t y ≈ 1 000.Now , so we express our preference with 3 2κ χ + 2κθ 1−(αθ) 2 1 2 = log 1 000.Attributing the variance in equal parts to χ and θ by setting κ χ = κθ 1−(αθ) 2 , since they are additive effects on the same scale, we have κ χ = 1.25 and κ θ = 0.25 on rounding.As for κ φ , this is robust to the choice of threshold due to the logarithm, so the posterior is relatively insensitive over values plausible a priori.Moreover, we have considered two fixed context words, not the most extreme pair, so more extreme variation is allowed.

Posterior distribution and MCMC inference
For convenience, we infer posterior distributions for the sense prevalence parameters φ, the word-time parameters θ and the word-sense parameters χ, but our interest is in the identifiable probability arrays φ and ψ which are given as deterministic functions of these latent variables.We expect the sense assignment vector z to be of less interest, except in testing where we have hand-annotated meanings.The joint posterior for φ, θ, χ, z given the data W is defined by where N z k,g,t = which is a multinomial distribution over possible senses {1, . . ., K}.
The authors of SCAN and GASC both use a blocked Gibbs strategy to alternately sample z|W, φ, ψ using (5) and φ|z and ψ|z, W .Under the SCAN and GASC models, each column of φ and ψ has a logistic normal distribution which the authors target with a Gibbs sampler based on the auxiliary uniform variable method of Mimno et al. (2008).An alternative Gibbs sampler is based on auxiliary Polya-Gamma variables (Polson et al., 2012) and an approximate method based on the same is given by Chen et al. (2013).We describe these methods in Appendix B.
Under our DiSC model, since ψk,t does not have a logistic normal distribution, the latter two methods cannot be used in a straightforward manner, although the auxiliary uniform variable method can be easily adapted.Moreover, both the auxiliary uniform and Polya-Gamma variable methods are very inefficient for these models, whereas the approximate method is obviously not asymptotically exact.We describe our MCMC sampler, which is asymptotically exact and at least as efficient as the existing samplers, and may be used with any of these models.
We marginalise over the discrete z, removing the need to sample z|W, φ, ψ, and alternate between sampling φ|W, θ, χ and θ|W, φ, χ and χ|W, φ, θ (or, in the case of SCAN/GASC, between φ|W, ψ and ψ|W, φ).The marginal likelihood for φ, θ, χ|W is where ( 6) exploits the conditional independence between the snippets, (7) comes from conditioning on (and summing over) the sense assignment z d for snippet d, (8) exploits the conditional independence of the context words w d,i , i ∈ {i 1 , . . ., i Ld } in snippet d, and (9) simply picks up the appropriate probabilities from the φ and ψ arrays.The conditional posteriors for φ, θ and χ are therefore and π(χ|W, φ, θ) ∝ π(χ) respectively, which we can sample efficiently using gradient-based MCMC methods such as Metropolis-Adjusted Langevin Algorithm (MALA) and Hamiltonion Monte Carlo (HMC).We describe these methods, including the derivation of the gradient vectors and automatic parameter tuning, in Appendix C. At this level, the marginal posterior distribution in GASC is the same as DiSC (the essential difference being the parameterisation of ψ and the priors) and any sampler relevant for DiSC also applies for GASC.This marginal is often tractable for LDA but not used, as the collapsed Gibbs sampler obtained by integrating over the conjugate priors of the continuous parameters is favoured there.That is not possible here as the priors are not conjugate.In addition to these sampling methods, we implemented a simple random-walk Metropolis sampler, both jointly and marginally over z.This was useful for code-checking not competitive and has been omitted.
For both datasets, we find that 20k MCMC iterations for SCAN/GASC with burn-in of 10k, and 10k MCMC iterations for DiSC with burn-in of 5k, are sufficient for convergence, where an iteration is one update over all parameters (except κ φ for SCAN/GASC, which is updated every 50 iterations).Work to date on SCAN and GASC appears to take just 1k MCMC iterations in total, which we found was not nearly enough for convergence on these data.All figures and results reported in the following sections are based on posterior means unless otherwise indicated.The posterior is invariant under sense relabelling in all models, but this behaviour was not observed in any converging run.

Experiment 1: Finding the best sampler
We take as our test case for comparing the samplers the problem of fitting the SCAN model (i.e.GASC with one genre) for the target word "bank" in COHA.We are limited in these choices since, as we report in Section 9, no MCMC sampler we tried converged for the SCAN/GASC model on the "kosmos" analysis, and the Poly-Gamma samplers cannot be used with DiSC due to the additive form for ψ.The metric we use for this comparison is the effective sample size (ESS) per hour of CPU time after the burn-in.We implemented the samplers in the R programming language, as efficiently as we could, and using the same functions as far as possible.We checked that they converged to the same posterior distributions for all variables with high precision on small synthetic datasets.All runs are done sequentially on the same Linux PC.
For φ, we are inferring KGT parameters and we take the median ESS per hour over all parameters.For ψ, we are inferring V KT parameters but our interest is mainly in the most representative words for each sense; so we take the median ESS per hour across the ψ parameters for the 20 most probable words under each sense marginally over all time periods.The medians, together with the interquartile ranges, are shown in Table 4.Even allowing for uneven coding efficiency, it is clear that the differences are substantial.
The auxiliary uniform variable method (top row) used in the past to fit SCAN and GASC is the least efficient for φ by an order of magnitude or more, and the auxiliary Polya-Gamma variable method (second row) is by far the least efficient for ψ.The approximate Polya-Gamma auxiliary variable method (third row) is efficient, but not asymptotically exact, though fairly accurate in our experiments comparing against asymptotically exact samplers.Our MALA and HMC samplers targeting marginal posteriors are asymptotically exact methods of similar or better efficiency for φ and ψ  steps) 1,613 ( 864 -1,816) 1,312 (1,054 -1,531) † Randomly chooses 1 or 2 leapfrog steps for φ proposals and 1 or 5 leapfrog steps for ψ proposals at each update respectively, and are therefore preferred.MALA is comparable to a 1-step HMC sampler, so we may combine the strengths of MALA and HMC by randomly choosing either a 1-step or a multiple-step proposal in our HMC sampler at each update.This mixed MALA-HMC sampler (last row) is clearly the most efficient for both φ and ψ.
HMC, whether straight or mixed with MALA, has one additional parameter to tune, i.e. the number of leapfrog steps.Users may therefore prefer MALA for convenience, or mix MALA with the No-U-Turn HMC sampler of Hoffman and Gelman (2014) to avoid tuning the number of steps.Trace plots in Figure 5 in Appendix C illustrate the differences in mixing rates between the samplers for a prevalence parameter (where the difference is most visible) in runs of equal time.

Experiment 2: Analysis of sense change for "bank"
We now measure the time-evolution of the prevalence and context word composition of the different senses of a target word in a simple example.Our objective is achieved if the posteriors for ψk,t for k ∈ {1, . . ., K} can be interpreted by a human as K unique target word senses, since these senses are automatically identified over time t ∈ {1, . . ., T } in the snippets.In the case of "bank", both DiSC and SCAN/GASC achieve this since the most probable words under an ordering based on the posterior expectation of timeaveraged word probabilities 1 T T t=1 ψk,t from both DiSC and SCAN/GASC display the senses of river-bank and institution bank respectively: k=1: river stream water stand bank leave tree creek time reach k=2: bank national note money deposit saving reserve credit loan issue The most probable words in each time period can easily be extracted if their evolution is of interest.
In order to assess the performance of the models, we consider their ability to recover the true sense labels o d , d ∈ {1, . . ., D} for the D = 3 525 snippets that we manually tagged (cf.Section 3).Let p(z d = k) be the estimated value of E φ,ψ|W p(z d = k|W d , φ, ψ) computed on the MCMC output.The Brier score, in our case defined by 5 in the case of K = 2 senses for "bank", so a model must produce a score lower than this in order to be useful.Using the posterior mean probabilities p(z d = k) obtained from normalising (5), we get the Brier scores shown in Table 5 from running DiSC and GASC under various genre configurations, where we fit both models using MALA.Both models therefore identify the true sense with a good level of accuracy.However, DiSC has slightly better performance in all cases.
We also examine confusion matrices from a simple classification task.We assign snippet d the sense l = arg max k∈1:K p(z d = k).Table 6 shows the key statistics from the confusion matrices produced using this decision rule (treating river-bank as the positive and institution bank as the negative condition) suggesting that the DiSC model is marginally better than SCAN despite having about half as many parameters.
Marginal Highest Posterior Density (HPD) intervals are a useful visualisation of uncertainty in the model output.In order to form a summary of the evolution of the sense prevalence, we look at the 95% HPD intervals for the marginal posteriors φg,t k |W from the DiSC and SCAN/GASC output.We compare these against independent well-informed estimates.We have the true sense labels o d , d ∈ {1, . . ., D} (not available in general) for these data.However, these are only multinomial draws with category-probabilities given by the unknown true prevalence Φ say.Previous authors have benchmarked against empirical sense probabilities N o k,g,t / K l=1 N o l,g,t computed using true sense labels.These are MLEs for the components of Φ in each sense, genre and time period.We expect our HPDs, computed on unlabelled data, to match these when the counts are high and the multinomial MLEs have small errors.However, when the labelled data counts are low (as in the "kosmos" data) or vary over a wide range from one time period to another, we should quantify the uncertainty in these independent estimates of the ground-truth prevalence Φ.
We therefore treat the true sense labels as a second dataset, and validate our HPD sets estimated on the unlabelled data against HPD intervals for φg,t by taking the true sense labels as data in DiSC.(Whilst this may appear to favour DiSC, in fact, for this part of the model, the conditional distributions of φ|z are the same in all important respects, differing only by the AR(1) smoothing in DiSC.)A high degree of overlap between the HPD intervals from the two posteriors (based on labelled and unlabelled data) indicates good model performance.Figure 1 shows that this is the case for both DiSC and SCAN (recall that SCAN is just GASC for G = 1) for both senses of "bank" for most time periods, although they start to diverge towards the end, perhaps indicating some over-smoothing across time.The performance of DiSC and SCAN is very similar.The ground-truth prevalence estimates based on MLEs (coloured bars) and labelled-data posteriors (dashed error bars) are very close for these data as the sample sizes are large (cf.Table 2).Equivalent graphs for thinner time intervals are given in Appendix D.
As discussed in Section 4.1, DiSC captures only the main sense and time effects via the additive structure in ψ whereas SCAN includes the sense-time interaction effect.As an example, Figure 2 shows the HPD intervals for ψk,t v |W for the context word v = "commercial" under both models."Commercial" appears predominantly under the institution sense of "bank", and increases in probability over time, which is reflected under both posteriors.However, whilst ψk,t v |W under SCAN for the river-bank sense remains relatively flat, in contrast, the DiSC posterior has ψk,t v |W increasing for both senses, as the contribution from θ t v |W increases and does not distinguish sense.We know from Brier scores and confusion matrices that DiSC gives better automatic senselabelling than SCAN or GASC on these criteria.The reason for this is that although SCAN is capturing the interaction here, and DiSC is not, such interactions seem to be  rare, and when present involve a context word with a relatively high context frequency ψk,t v |W in the evolving sense and low context frequencies in all other senses (as generic context words have low frequency in any given sense).Our example illustrates this.Small absolute errors in small, uninformative context frequencies are offset by more reliable estimation of larger, more informative context frequencies.This is discussed further in the next section.

Experiment 3: Model predictive performance on synthetic data
Since DiSC drops sense-time interaction (cf.Sections 4.1), it is of interest to find examples in which this causes it to fail.We therefore compare the models' predictive performance on held-out sense labels in the presence of a known sense-time interaction effect in the data.We try to make this easy for SCAN and hard for DiSC: we calculate Brier scores on synthetic datasets of varying sizes generated using the SCAN model.Using the settings T = 9, K = 3 and V ≈ 1 000 seen in "kosmos", we sample parameters φ, ψ, κ φ according to the SCAN prior model, as this generates sense-time interaction.We then sample the true sense labels o and snippets W using the SCAN/DiSC observation model for a fixed number of snippets D/T per time period, using the same stopword probability and context-word registration criterion as for "kosmos" (cf.Section 3).We choose a large D/T so that the data contain lots of sense-time interaction and also strongly inform this interaction.This setup might be expected to favour SCAN and challenge DiSC.
These random synthetic datasets do contain an abundance of sense-time interaction.Consider the following simple measure of the level Λ of interaction in the simulated for all v, k, t where Then, for each context word v ∈ {1, . . ., V }, we fit a linear model where time t is continuous (since discrete t leads to a perfect fit) and ε v is Gaussian noise.
We make an F -test for the interaction effects η l v at level 5% and measure the extent of sense-time interaction Λ as the proportion of context words {1, . . ., V } for which the interaction effects in ( 14) are significant.This is a conservative measure, since it only captures linear sense-time interactions and misses potential situations where context probability increases and then decreases (or vice versa) over time.In the synthetic data with D/T equal 100 and 500, the interaction level is Λ = 0.13 and Λ = 0.20 respectively.For comparison, using this measure, we have Λ = 0.144 and Λ = 0.070 respectively for the "bank" and "kosmos" datasets, so on this simple measure the extent of interaction in these synthetic data is representative.
Table 7 summarises the results of these experiments.Our MCMC for SCAN on the smaller D/T = 100 dataset did not converge, whereas we get good convergence on the larger D/T = 500 dataset.DiSC estimates a smaller number of parameters than SCAN, with lower variance but potential bias.This tradeoff seems to be advantageous for senselabelling: as evidenced by the Brier scores on the larger dataset, DiSC sense-labelling on SCAN-friendly synthetic data is as good or better than SCAN itself.
The Brier score is a proper scoring rule, so we expect SCAN to do better on average over synthetic datasets -on average but not necessarily in probability.Posterior plots similar to Figure 2 (not included in this paper) confirm our intuition that, despite the large proportion of context words with a significant sense-time interaction effect, very few of these words appear with high probability across more than one sense.This may explain the Brier score ordering.It is possible to construct data for which SCAN scores more highly than DiSC by artificially fixing the prior parameters to include a very large proportion of words with both high frequency in more than one sense and strong interactions (cf.Section 4.1).However, this seems unrepresentative of typical real data.
In experiments where we explicitly introduce such strong interactions across multiple senses (discussed in Appendix E), we find that the performance of SCAN is a little better than that of DiSC but the gain is slight.In our experience, modelling sense-time interactions is detrimental for automatic sense-labelling.We now come to the principal application.We will make this analysis twice: here, and in Appendix F, depending on whether we exclude (as here) or include non-collocates.
All translations of Greek words in this section have been obtained from Wiktionary.
The ancient Greek data for target word "kosmos" (κόσμος) contains considerably fewer snippets than the "bank" data (cf.Section 3) whilst using almost the same sized vocabulary, making it a relatively sparse and noisy dataset.The "kosmos" data contains other features making this analysis harder than the "bank" analysis: the three senses of "kosmos" are not as well separated as those of "bank", and a number of context words appear with high probability under more than one sense in the expert annotation o d , d ∈ {1, . . ., D}. Examples include θεός (divine/god {0.26, 0.10, 0.64}), ἀνήρ (man {0.45, 0.47, 0.08}) and πόλις (city {0.30, 0.64, 0.06}) among others where, given that word v appears as context in a snippet, it appears under sense k with empirical probability This makes the task of automatic sense-identification in the "kosmos" data particularly challenging.
Using our DiSC model with two genres, the words most probable a posteriori under each sense are as follows: 1 γυνή χρύσεος φέρω κοσμέω καλός σῶμα ὅπλον ἐσθής πλῆθος τάξις 2 πόλις πολιτεία πρότερος ἀνήρ φέρω νόμος μέγας πάρειμι ἀρχή καθίστημι 3 γῆ οὐρανός ὅλος θεός εἷς φύσις καλός σύμπας ψυχή σῶμα We compare these against ground truth.We identify the senses {1, 2, 3} with the labels {decoration, order, world} by mapping our marginal posterior distributions 1 • under the expert-annotated sense labels o d , d ∈ {1, . . ., D}.We note that certain distinctive words that help identify the sense have been correctly assigned by our model, for example: γυνή (woman) and χρύσεος (golden) for "decoration"; πολιτεία (citizenship) and νόμος (custom/law) for "order"; γῆ (earth) and οὐρανός (sky) for "world".Some words such as τάξις (literally "order" but assigned with "decoration") have been misplaced, although the error is small: if v = τάξις then v is assigned by the model to sense k with probability ψl,t v = {0.43,0.42, 0.15} for k = {1, 2, 3}.Using the Brier score to assess model performance, the baseline score under uniform random assignment is BS = 0.67 for K = 3 senses.The realised score of BS = 0.41 using DiSC therefore indicates good model performance.Ignoring genre and setting G = 1 gives BS = 0.47, i.e. worse than the G = 2 case, so the genre-covariate plays a role, in agreement with the views of Perrone et al. (2019).As in the "bank" analysis, the confusion matrix from a similar classification task for the "kosmos" data is summarised in Table 8, indicating generally good model performance.The relatively lower sensitivity for the "decoration" and "order" senses is reflective of the ambiguity discussed above.
In Figure 8 the 95% HPD intervals for the marginal posteriors φg,t k |W from the DiSC output have a high degree of overlap with the posteriors φg,t k |(z = o) based on expertannotated sense labels for all k, g, t, again indicating good model performance.The DiSC HPD intervals contain the empirical estimates N o k,g,t / K l=1 N o l,g,t for these data in most cases, with the exceptions being the time periods with little data (cf.Table 3).For these time periods, ground-truth sense prevalence estimates given by the posteriors φg,t k |(z = o) seem to us a more appropriate basis for comparison than the simple empirical estimates.Adjacent temporal data smooths these posteriors.Also, where there is limited data at both time t and times t ± 1, the increased uncertainty in the ground-truth is quantified in the wider HPD intervals for those times t.The very high degree of overlap between the marginal posteriors φg,t k |W and those for φg,t k |(z = o) indicates we are doing as well, for prevalence estimation with the unlabelled data, as we would do if we actually had the ground truth.
Using GASC, on the other hand, we find that all MCMC fail to converge to the same stationary distribution starting from different random configurations.Exploration showed that the posterior distribution for GASC contains multiple metastable states with very long lifetimes and multiple modes (besides those associated with label switching).We do not see convergence even when using our best samplers with parallel tempering on multiple cores.This seems to be associated with over-parameterisation in the GASC model for this small and noisy dataset: the ψ array has V KT parameters in GASC compared to only V (K + T ) parameters in DiSC.This is more than double the number of parameters, and leads to multiple modes of near-equal fit as measured by the loglikelihood.
The senses cannot be reliably identified from the model output as they are not sufficiently distinct.For example, our best run using GASC from six different random starting configurations gave the following most probable words: 1 ἀνήρ πολιτεία ἀρχή γυνή κύριος ἀξιόω τάξις πρότερος πάτριος κόσμος 2 φέρω γυνή πόλις καλός κοσμέω θεός μέγας τάξις πολιτεία σῶμα 3 γῆ πόλις ὅλος θεός οὐρανός εἷς φύσις σῶμα πέντε σύμπας Since representative words for both "order" and "decoration" (such as πολιτεία and γυνή) appear with high probability under two senses, it is not clear how to assign the sense labels between them in order to make any sort of comparison.The same behaviour is seen further down the sense columns for other less frequent context words.Trying all possible permutations of the sense labels, the best Brier score we get is BS = 0.73worse than randomly assigning the probability p(z d = k) = 1 K for all d, k.

Conclusion
We have given a generative model of Diachronic Sense Change (DiSC), treating sense and time as additive effects, and a gradient-based MCMC method for inferring model parameters.
Whilst we adopt the overall modelling framework of Frermann and Lapata (2016) and Perrone et al. (2019), we found that the specific MCMC samplers used there take around 40 times as long to achieve the same ESS.Our MCMC exploits the fact that it is possible to marginalise by summing over the discrete sense assignments exactly.Gradient-based MCMC on the marginal is, not surprisingly, far more efficient than simple Gibbs sampling on the joint model.
We carried out automatic sense-annotation by identifying word-sense dependence, and discovered sense groupings.We measured time-evolution of sense distributions over context words and of sense prevalence distributions over senses.We showed that, for the well-behaved "bank" data where MCMC targeting DiSC, SCAN and GASC all converge, DiSC has slightly better predictive performance for held-out expert-annotated sense labels despite being the simpler model.We further showed that, for the smaller and noisier "kosmos" data where no well-calibrated fitting procedure is available for SCAN and GASC, DiSC works well in the sense that it gives prevalence estimation intervals close to those we would obtain if we had the true sense labels.
As far as we are aware, our analysis of "kosmos" in the test case Section 9 and "realuse" Appendix F give the first analysis of sense change for these data that returns good agreement with prevalence from expert annotation.One criticism examined in Section 4.1 is that DiSC does not model the sense-time interaction effect.In our setting, the sense of a target word is defined by the distribution over its context words.DiSC does model temporal change in this distribution, but captures only the main effects associated with changes in the overall usage frequency of the context words across all senses of the target word.It would be interesting to look for further examples of target words where the independent evolution of senses modelled by SCAN/GASC is measurable and impacts sense-labelling.
Our study of synthetic data in Section 8 showed that even when data are simulated under SCAN, with abundant interaction, DiSC gives reliable sense-label predictions and parameter estimates.Furthermore, in many potentially interesting cases where the data are sparse, there is no computational procedure we know of that will actually fit the interaction as parameterised in SCAN/GASC reliably.In future work we would like to find a parameterisation of the sense-time interaction effect that is tractable.DiSC would then be a natural null model to this alternative.
All these models can identify the emergence of new senses, but we have not explored this.It may be possible to include an atom of probability at φg,t k = 0 in order to carry out simultaneous and formal model-selection for the first time t at which φg,t k > 0 and there is evidence for a new sense.
DiSC shares some disadvantages with the generative models on which it is based.We target the senses of one word at a time, in contrast to some approaches in the machine learning literature cited in Section 2. However, DiSC does provide well-calibrated uncertainty estimates, as evidenced by our experiments, rather than just point estimates.Another limitation is that the number of senses K needs to be checked in multiple runs.It may be possible to learn K using more formal testing procedures.However, for estimation of sense distributions (over context words), sense prevalence and their evolution from unlabelled text, DiSC works well in a semi-supervised mode in which model selection is based on the requirement that the output be meaningful to the user.

Implementation
R scripts and data files used to produce the results, figures and tables reported in the paper are available from https://github.com/schyanzafar/DiSC.

B.1 Auxiliary uniform variable method
The auxiliary uniform variable method of Mimno et al. (2008), which is based on the method of Groenewald and Mokgatlhe (2005), is used to sample φ g,t k |φ g,t −k , z iteratively over k ∈ 1 : K by moving within a weighted and bounded region, where the weights are determined by the prior distributions and the bounds are determined by N z k,g,t .Figure 4 shows an example.Given N z k,g,t snippets assigned sense k and N z •,g,t snippets in total for time t and genre g, we draw N z k,g,t uniform variables below the logistic CDF and N z •,g,t − N z k,g,t uniform variables above the logistic CDF.That is, for all d ∈ {d : τ d = t and γ d = g} we draw and sample a new φ g,t k from the conditional prior (15) truncated to the bounded region max Algorithm 2 GASC: generative model --------PRIOR MODEL --------1: fix hyperparameters κ ψ , a, b 2: draw κ φ ∼ Inv Gamma(a, b) 3: initialise at time t = 1 4: for genre g ∈ 1 : G do 5: draw sense prevalence parameter from improper uniform φ g,1 ∼ π(φ g,1 ) ∝ 1 6: end for 7: for sense k ∈ 1 : K do 8: draw word parameter from improper uniform density ψ k,1 ∼ π(ψ k,1 ) ∝ 1 9: end for 10: for time t ∈ 2 : T do 11: for genre g ∈ 1 : G do 12: draw sense prevalence parameter φ g,t |φ g,t−1 , κ φ ∼ N φ g,t−1 , diag(2κ φ ) 13: end for 14: end for 17: end for 18: using softmax (1), transform real arrays φ and ψ into probability arrays φ and ψ --------OBSERVATION MODEL --------19: fix probabilities of drawing stopwords q SW and uninformative words q U 20: for snippet d ∈ 1 : D do 21: draw number of context words L d |L, q SW , q U ∼ Bin(L, 1 − q SW − q U ) 22: draw a random subset {i 1 , . . ., i Ld } of size L d from {1, . . ., L} The efficiency of this procedure can be improved by sampling two Beta variables instead of N z •,g,t uniform variables.The problem with this method is that the bounded region ( 17) can be very narrow whenever the dimensions N z k,g,t and N z •,g,t are large, leading to a very small move in each iteration.This is because sampling a large number of uniform variables below or above the CDF is likely to result in at least one variable being close to the curve.The resulting convergence is therefore very slow.

B.2 Auxiliary Polya-Gamma variable method
The auxiliary Polya-Gamma variable method of Polson et al. (2012) is used to sample φ g,t k |φ g,t −k , z iteratively over k ∈ 1 : K by first drawing from the Polya-Gamma distribution PG, where η = φ g,t k − log C with C as in ( 18), and then sampling where Σ = (σ −2 + ω) −1 , m = Σ(µσ −2 + N z k,g,t − 1 2 N z •,g,t + ω log C), and µ and σ 2 are the mean and variance for the relevant conditional prior distribution defined in (15) and depending on t.
A draw from a Polya-Gamma distribution is computationally expensive whenever the shape parameter N z •,g,t is large.This is very often the case with any real dataset, especially for ψ updates.Chen et al. (2013) give an alternative approximate method which cuts the computational cost of each draw from O(N z •,g,t ) down to O(1).They note that if x i ∼ PG(1, η) then ω = N i=1 x i ∼ PG(N, η) by the additive property of the Polya-Gamma distribution.Therefore, by the central limit theorem, ω is approximately Gaussian for large N , and may be obtained by transforming another approximately Gaussian random variable λ ∼ PG(M, η) viz.
where E[ω] = N 2η tanh η 2 and Var(ω)/V ar(λ) = N/M .This approximate strategy works well even when M = 1.The method is relatively simple.However, parameter inference is not asymptotically exact and, as we saw in Section 6, the method is in any case dominated by a hybrid MALA-HMC sampler which is asymptotically exact.

Appendix C Gradient-based MCMC methods
We sample the conditional posteriors (10)-( 12) for each variable by proposing an update from a suitable distribution and accepting or rejecting it using the Hastings ratio, iterating over the columns.For example for φ, we propose a candidate vector φ * g,t |φ g,• , ψ •,t , W D(g,t) ∼ q(•|φ g,• , ψ •,t , W D(g,t) ) ( 22) and accept it with probability where p(W D(g,t) |φ g,t , ψ and D(g, t) = {d : γ d ∈ g and τ d ∈ t} is the set of snippet indices for time(s) t and genre(s) g.Updates for θ, χ for DiSC or ψ for GASC are analogous.The conditional prior density π(φ g,t |φ g,−t ) is defined by ( 15), whereas for θ we have the conditional and for χ we have, unconditionally, In practice, it may be more efficient to update all columns of χ together, which is the approach we take in our R implementation for the applications in this paper.
The proposal density q(•|φ g,• , ψ •,t , W D(g,t) ) is constructed using gradient-based methods such as MALA (Roberts and Tweedie, 1996) or HMC (Duane et al., 1987), which make proposals in the direction where the posterior density is increasing.MALA, for instance, uses the proposal distribution for φ updates, and analogous distributions for the other variables, where σ 2 x and Σ x are the proposal scale and covariance parameters for variable x respectively.We keep the covariance Σ x fixed at the appropriate identity matrix, whereas we tune the scale σ 2 x using the log-adaptive proposals of Shaby and Wells (2010).When using HMC with the leapfrog method (see e.g.Neal (2012)), we keep the number of leapfrog steps fixed and use the same technique to tune the leapfrog step size σ 2 x .We experimented with tuning the covariance Σ x , including varying diagonal elements, without gain.
The tuning is done by calculating the running empirical acceptance rate ᾱ = # jumps N based on the last N MCMC iterations, and adjusting the scale parameter up or down via log σ 2 x ← log σ 2 x +C n (ᾱ−α opt ), where C n is a parameter that decreases with iteration number n (so that σ 2 x tends to a constant as n → ∞) and α opt is the target optimal acceptance rate.Under certain conditions, optimal asymptotic acceptance rates have been proposed as 0.574 for MALA (Roberts and Rosenthal, 1998) and 0.651 for HMC (Beskos et al., 2013), which are the values we use for α opt in our implementations.
The gradient of the log posterior density in ( 27) can be broken up into the sum of gradients of the log prior density and the log likelihood.The gradients of the log prior densities ∇ φ g,t log π(φ g,t |φ g,−t ), ∇ θ t log π(θ t |θ −t ) and ∇ χ k log π(χ k ) are of the form −V −1 x (x − µ x ) where the mean vector µ x and covariance matrix V x for variable x are given by ( 15), ( 25) and ( 26) respectively.The gradients of the log likelihoods are derived in the next subsection.
As we find in Section 6 above, these gradient-based methods give significant efficiency gains over methods like the auxiliary uniform or the (asymptotically exact) auxiliary Polya-Gamma samplers described in Appendix B. This is illustrated in Figure 5 in which the x-axis shows the same elapsed time in CPU seconds across all plots and the y-axis shows the MCMC state for the prevalence parameter in one genre and time period.The gradient-based samplers shown on the right give much more rapid mixing than the other two methods.

Appendix D "Bank" additional results
We experimented with thinner time intervals on the "bank" data, as shown in Figure 6.As expected, the uncertainty increases due to less data in each interval, but the Brier scores remain comparable to those for the 20-year intervals (BS = 0.15 for DiSC).The 5-year graph shows an interesting and rather dramatic change in the empirical sense prevalence around 1925, which is smoothed when we take wider intervals.This sharp change could be caused by changes in the makeup of the source texts in the corpus, e.g. if a number of finance-related texts where introduced at this time.However, we did not identify a change of this sort in the corpus.We should be aware that φ (and ψ) measures prevalence in a language sample (i.e. the corpus) and not prevalence in historical language use itself.
In choosing an appropriate interval, two factors come into play.On the one hand, subject specialists have some understanding of the timescale of variation, and time intervals should be no greater than that scale.On the other hand, it is natural to take them as short as possible, subject to having enough data in each interval to usefully inform parameter estimates.There may be some interplay with κ φ , κ ψ and κ θ as these diffusion parameters would naturally be scaled with interval length, so smaller for shorter intervals.When intervals are short, the fit becomes more sensitive to the choice of these hyperparameter priors.Using longer intervals makes the inference more robust to these choices.We encourage user exploration.ψv |W for a single word v in the three examples, together with the true ψv , are shown in Figure 7.The interaction effect is progressively stronger from example 1 to example 3. It can be seen that the DiSC posterior ψk,t v |W is rather flat over time for all senses in example 2, and is thus inaccurate at time 1 for the green sense and at time 9 for the orange sense.Similar inaccuracies can be seen in example 3 for the orange sense at the start and end.
The Brier scores on these examples are shown in Table 9.The performance of DiSC   and SCAN on sense-labelling is comparable in all cases.Even where the interaction effect is very strong in examples 2 and 3, DiSC does not perform much worse than SCAN.Note that having more words with explicit interaction in example 3 reduces the proportion of noisy words, and hence makes it easier to identify the sense.We attempted to construct an example where DiSC fails but SCAN performs well, but were unable to do so.(In any example where DiSC failed, so did SCAN.)Moreover, it proved difficult to get SCAN to converge on these examples due to label-switching between time periods and multi-modality.It appears that not only is sense-time interaction rare in any real-world scenario, but even where it does actually exist the gains from omitting it from the model for sense-labelling purposes far outweigh the small loss in accuracy.

Appendix F "Kosmos" additional results
It may be argued that retaining only the snippets that an expert was able to annotate from context alone (category "collocates", cf.Section 3) biases the data in the sense that the probabilities φ and ψ will in general change.A related concern is that we have made the problem easier than the one we face when applying the method to data without knowing which snippets do not admit a human classification.The "kosmos" dataset contains 1,469 snippets, of which 1,144 are of the type "collocates".The remaining 325 snippets therefore represent 22% of the data, which is a very significant level of noise given the small and sparse dataset.In our experiments, we tried to follow as closely as possible what Perrone et al. (2019) did, and therefore removed this noise in order to ensure a fair comparison with GASC.
Recall that, in choosing the number of senses K we recommend running in a semisupervised mode: are the most frequently associated words under each sense meaningful to the user?We repeat the analysis with non-collocates included and choose K on this criterion.We find that neither DiSC nor GASC identify recognisable meanings on K = 3 senses, and the Brier score is higher than 0.67 (the Brier score under uniform random sense assignment).
When we run DiSC with K = 4 senses on all the "kosmos" data, we identify three of these senses with decoration, order or world based on the most probable context words under each sense.The fourth sense has no recognisable meaning and we think of this as a sense the model is using to capture "noise".We were unable to get the GASC model to converge even with this setting -something we found typical for GASC.
We normalise the sense probabilities p(z d = k) (cf.equation ( 13)) over the three recognised senses and compute the Brier score on type "collocates" data.Thus, in evaluating performance, we exclude snippets a human could not label.This gives a score of 0.38 for K = 4 (slightly better than the 0.41 we advertised for K = 3 senses on the data with non-collocates removed).The equivalent sense prevalence graph from the DiSC K = 4 run is in Figure 8, and shows good agreement with the run conditioned upon the true sense labels.For completeness, if we analyse all the data with K = 4 senses, and compute the Brier score on all the data (i.e.not just on collocates), using the sense labels the human expert assigned to non-collocates from broader contextual considerations outside the words around the target word, we get a score higher than 0.67.Thus, in situations where a human cannot annotate the sense based on context, our model cannot either.

,
z d | φγd,τd ∼ Mult φγd,τd 1 , . . ., φγd,τd K 25: for context position i ∈ {i 1 , . . ., i Ld } do 26: draw context word w d,i |z d , ψzd,τd ∼ Mult ψzd,τd 1 d:τ d =t and γ d =g I(z d = k) is the number of snippets with sense assignment k and genre g at time t, and N W,z v,k,t = d:τ d =t and z d =k iL d i=i1 I(w d,i = v) is the number of occurrences of context word v across all snippets with sense assignment k at time t.The conditional posterior for z is defined, independently for each snippet W d , by

Figure 1 .
Figure 1."Bank" expert-annotated empirical sense prevalence (coloured bars with height N o k,g,t / K l=1 N o l,g,t for each k, g, t) with 95% HPD intervals (error bars) and posterior means (circles) from the model output.Note that there is no perceptible difference between the posteriors φ|z from DiSC and SCAN.

Figure 3 .
Figure 3. "Kosmos" expert-annotated empirical sense prevalence (coloured bars with height N o k,g,t / K l=1 N o l,g,t for each k, g, t) with 95% HPD intervals (error bars) and posterior means (circles) from the DiSC output

Figure 5 .
Figure 5. φg,t k trace plots with equal run times (after burn-in) for the institution sense of "bank" at time 1850-70 from the unlabelled SCAN posterior

Figure 6 .
Figure 6."Bank" expert-annotated empirical sense prevalence (coloured bars with height N o k,g,t / K l=1 N o l,g,t for each k, g, t) with 95% HPD intervals (error bars) and posterior means (circles) from the model output.The first graph omits SCAN φ|W error bars since we could not get the MCMC to converge for SCAN.

Figure 7 .
Figure 7. Examples with explicit sense-time interaction, showing the context probabilities for a typical word v under different senses over time.True probabilities are represented by the coloured bars, 95% HPD intervals by the error bars, and posterior means by the circles.

Figure 8 .
Figure 8. "Kosmos" expert-annotated empirical sense prevalence (coloured bars with height N o k,g,t / 3 l=1 N o l,g,t for each k, g, t) with 95% HPD intervals (error bars) and posterior means (circles) from the K = 4 DiSC run

Table 1 .
Example text snippets for target word "bank" " . . .China.The Yellow River had burst its banks, submerging vast areas of farmland,

Table 2 .
Frequencies of "bank" in each sense-time block across all genres

Table 3 .
Frequencies of "kosmos" in each sense-genre-time block d,i |z d , ψzd,τd ∼ Mult ψzd,τd If we modify this and impose a single topic per document, it becomes precisely the GASC model with G = 1; so comparison of DiSC with the dynamic topic model itself is not appropriate.

Table 4 .
Median (interquartile range) ESS per hour of CPU time

Table 5 .
Brier scores under different genre settings

Table 7 .
Model performance for DiSC and SCAN on synthetic data

Analysis of sense change for "kosmos"
Relationship between φ g,t k , z, and u in the case of 5 snippets.LEFT: Given φ g,t k , ifz d = k then u d (black circle) falls below φg,t k ; and if z d = k then u d (white circle) falls above φg,t k .RIGHT: With u given, φ g,t k can be anywhere within the interval defined by the highest black circle and the lowest white circle.Figure adapted from Mimno et al. (2008).

Table 9 .
Brier scores on synthetic data with explicit sense-time interaction