A data-driven approach to studying changing vocabularies in historical newspaper collections

Nation andnationhoodare amongthe mostfrequentlystudiedconcepts in the ﬁeldof intellectualhistory.Atthesametime,theword‘nation’anditshistoricalusageareveryvague.Theaiminthisarticlewastodevelopadata-drivenmethodusingdependency parsing and neural word embeddings to clarify some of the vagueness in the evolution


Introduction
There has been extensive research on the process in which nation-states become pivotal units for international politics and crucial categories of belonging for individuals (Ö zkırımlı, 2000;Anderson, 2006;Smith, 2013Smith, , 2008. Our aim in this article was to use state-of-the-art word embeddings to describe how this process is reflected in the written use of four different languages: Dutch, English, Swedish, and Finnish. Earlier work has focused on words and concepts relating to nationhood, pointing out the general trajectories of word use and the increased levels of vagueness of 'nation' as a concept (Kemiläinen, 1964;Gschnitzer et al., 1978). Word embeddings are commonly used in natural language processing (NLP), but their application in historical research is still at the experimental stage. There have been many robust attempts at evaluation in state-of-the-art research using embeddings, but in the case of historical data, the evaluation has to be against historical research. It therefore remains difficult to determine what the real object of the modelling is and if the results are transferable to other languages, different corpora, or time spans.
Our large-scale comparative perspective demonstrates changes in the development of nationhood with greater clarity than before in focusing on the term 'national' and using words associated with it to analyse domains that were increasingly being conceptualized as national. One of the benefits of our case is that the words nation and national exist in all four languages as cognates or, in the case of Finnish, as neologisms in the 19th century. The historical translatability makes it ideal for comparative study in that it highlights both similarities as well as differences between the languages in question.
Many studies on semantic change of particular words or concepts over long periods of time tend to focus on changes in words that shift between two distinct senses (Recchia et al., 2017;Tahmasebi et al., 2018). The case of national that we chose for this study is different in that it relates to historical processes that are of interest to historians in particular, but it also provides a challenging case for the use of computational methods as it is not about detecting polysemy, but rather about grasping a vague term and its increasing importance in political discourse over time (on vagueness, see Geeraerts, 1993). This also holds for most of the key terms of interest in understanding political, social, and cultural transformation in the modern period. Words such as state, nation, ideology, culture, gender, and racism have been extensively researched as pivotal terms that have been contested in past debates and whose changing meanings have been indicative of historical transformation, but have also been the cause of change in the past (see e.g. Koselleck, 1972Koselleck, -1997Ball et al. 1989). Although many of these words are polysemous, the aspect that makes them interesting politically and culturally is that they are also vague, at least in one of their senses, that they are used in rather different language domains, and that historical actors seem to have cared a lot about which uses were correct. The vagueness of key terms for navigating society is inherently tied to the complexity of the data required to detect shifts in language use. Historical data have developed in conjunction with societal processes and events (everything from growing wealth to war and censorship practices to changing fashions), and therefore form non-standard data units in terms of computation (Mäkelä et al., 2020). More importantly, developments in the data, in our case, newspapers, are part of the process in which the terminological changes took place. This means that these newspapers cannot be used just to study changes, as the changes in them also need to be factored into the interpretation of the analyses.
The linguistic change relating to 'national' consists of a gradual growth in frequency and an expansion in language domains over time. Setting up a methodology that grasps this development over a long period of time, does so in different languages, is statistically robust and does so in a data-driven way, will pave the way for further historical study that could challenge and complement earlier qualitative accounts of nation-building. We point out that hypotheses developed in earlier studies based on limited source corpora have referred to nations and a shift in focus from the economy to culture and politics (Viroli, 1995;Ihalainen, 2007;Nurmiainen, 2009;Marjanen, 2013). We further propose that a data-driven clustering of the vocabulary relating to national allows for a more fine-grained image of the expansion of the national imaginary. We show signs of change in the language of nationhood that could perhaps be described as processes of culturalization, de-economization, and institutionalization, which should be evaluated more closely in historical research. What this means is that, over the course of the research period, terminology related to culture and political institutions became more commonly labelled as something national (as in national literature or national party), whereas economic terminology became proportionately less dominant in the discourse.
Our method is particularly suited to analysing complex historical keywords that are usually at the heart of studying the history of political and social thought. The development of methods and the concrete plots we devise in this study relate to the vocabulary revolving around the adjective national, but the aim here is not primarily to make a historical argument about nation building. We rather purport to identify ways of using computation to analyse historical trends in past conceptualizations of the world in a more nuanced way than key-word searches, relative frequencies, or topic models have made possible. Ultimately, the methods used to address historically informed questions need some level of tailoring to the data and the type of questions asked. However, given that the bulk of large-scale diachronic text data sets provide possibilities for the study of language in relation to historical processes, there are good possibilities of reuse in other research cases. This goes handin-hand with open science and the envisioning of research data as an ecosystem ).

Language and nationalism
Nationalism is a widely studied phenomenon and the role of semantic and lexical change has been noted in literature that provides overviews of the topic (see, in particular, Leersen, 2006, p. 15;Burke, 2013;Gilbert, 2018), but the bulk of the literature on nationalism has still been surprisingly indifferent towards the language of nationhood. This disinterest in the long-term changes in language relating to nationhood means that the analytical distinctions relating to nation-states and their emergence has been prioritized at the expense of enhancing understanding of the historical experience of nationhood.
All of the above-mentioned studies, in one way or another, concern long-term trends in the meanings and uses of the words 'nation', 'national', and 'nationalism'. However, apart from in a few isolated cases of resorting to relative frequencies, the use of quantitative methods to trace long-term developments in this vocabulary is almost completely non-existent. The one exception is Van den Bos and Giffard's study, which focuses on key junctures in Dutch history and the language of nation (van den Bos and Giffard, 2016). The present study takes a step forward and a step backward. On the one hand, it engages in earlier claims about changes in word use being part of the process in which past expectations and experiences about nationhood were articulated (Koselleck, 1972(Koselleck, , 2011, which on the other hand leads to claims that an over-arching study of the language of national could, in a general way, describe the process through which the national perspective became dominant in how people saw the world (Anderson, 2006). Although interpretations such as these already exist, they all rely on examples of particular texts rather than any kind of data-driven analysis, which means that they may be detailed in terms of individual examples, but they are not even close to capturing the whole story.
Methods for tracing this kind of change are not a perfect match for the historical questions posed in earlier research . It is clear that earlier abstract claims about the shift in focus in the language of nationhood remain too broad to be captured in a meaningful way by methods for tracing semantic change in that they capture many different and partly conflicting signals from the data. Human interpretation has tended to filter them out, and quantitative methods for assessing shifting vocabulary necessarily have to find a good way of balancing detailed view and a result that is interpretable for human readers. There are good arguments for claiming that modelling may in some cases be less transparent and cannot capture the same things as qualitative interpretation (Biernacki, 2014), but in terms of understanding the evolving language of nationhood, the aspect of modelling and quantification has been completely missing.

Evolving vocabularies
The traditional focus in conceptual history has been on specific keywords such as 'democracy', 'liberalism', A data-driven approach to study changing vocabularies Digital Scholarship in the Humanities, Vol. 36, Supplement 2, 2021 ii111 and 'nation', but only to a limited degree has there been any systematic analysis of semantic and lexical fields related to these keywords. The concentration on words has led to extensive discussions about their exact relationship with concepts (Steinmetz, 2012;Bolla et al., 2019;Lähteenmäki and Kaukua, 2019;Bolla et al., 2020). Although we do not assume that we can grasp the conceptual level behind words as such, we take a pragmatic approach and use distributional 1 methods to study changing vocabulary. These methods allow us to broaden the scope from words to groups of words (that are in some way related to concepts) through time. We are still intent on using words as proxies and thus remaining on the level of words and language use, because that will enable us to capture at least some of the personal experiences of historical actors. When they sought to express certain concepts, they chose particular words that reflected their own positions and thus left a trace of their experiences in the data. Moreover, focusing on words allows for the relatively easy tallying of their occurrences. The challenge in analysing vocabulary is to 'strike a balance between an adaptive strategy that responds to changes in vocabulary, and a more conservative approach that keeps the vocabulary stable' (Kenter et al., 2015). The vocabulary must maintain a minimal degree of stability in order for it to be historically relevant and meaningful, but at the same time, it should solve the problem of different words relating to the same concept over time (onomasiology).
Rather than considering a predefined group of words over time, distributional methods allow for a more data-driven approach. Recent scholarship in history has used word embeddings to identify semantically related words and to follow their development over time. This requires an initial set of seed terms that is subsequently expanded by selecting similar words (Kenter et al., 2015;Recchia et al., 2017). Another approach is to identify a vocabulary based on features of single words such as 'isms' (Pivovarova et al., 2019, Marjanen et al., 2020, or sequences of words (ngrams). The latter approach constructs a vocabulary based on words that are directly preceded (Wevers, 2017;Van Eijnatten and Ros, 2019) or modified (Hill et al., 2018) by a common adjective, and subsequently focuses on the temporal changes. This leads to the quantification of conceptual extension, and gives insights into conceptual and distributional change that would go unnoticed were the focus only on specific keywords. Our method builds on such previous work, and in delegating the choice of 'seed terms' to nouns modified by a specific adjective allows for a more data-driven approach, while at the same time retaining some 'topical control' and harnessing semantic information from word embeddings.

Representing meaning in time
As noted above, previous attempts at studying an evolving discourse diachronically made use of computational methods and large corpora. More recent approaches lean on NLP. In this section, we discuss the state-of-the-art and illustrate why studying a specific theme over time is not trivial.
Topic modelling is extensively discussed and is sometimes used in the humanities (Fridlund and Brauer, 2013;Viola and Verheul, 2019). Although the soft clustering method is most commonly used synchronically for exploratory research, there are also dynamic topic models (DTMs) that take time as a variable and allow the extraction of topics across time slices. DTMs (Blei and Lafferty 2006) divide the data into discrete time slices and infer topics across them to capture topics evolving over time. A different approach, Topics over Time (Wang and McCallum 2006), treats time as a continuous variable and the data are not discretized. Although both approaches are promising, their major drawback is that the topic models do not allow for a topic to be defined a priori: they allow an exploratory look at the data, but there is no easy way to ensure that a certain topic will be found.
Another field in which meaning is studied computationally across time is that of lexical semantic change, which is particularly suited for conceptual change in that it focuses on words and not general themes (Kutuzov et al., 2018;Tahmasebi et al., 2018;Tang, 2018). To study meaning change, computational methods proceed in two steps: first, they distributionally model meaning in different time bins (subsequent temporal slices of the data at hand). Second, the focus is to detect, for any word w, whether the signal between time bins changes in a significant way. In recent years, even laws of semantic change have been proposed (Dubossarsky et al., 2015;Hamilton et al., 2016) and then disproved (Dubossarsky et al., 2017). Some methods have been under rigorous evaluation (Dubossarsky et al., 2019;Schlechtweg et al., 2019Schlechtweg et al., , 2020Shoemark et al., 2019). At the same time, new methods and paradigms aimed at diachronically modelling semantic information are being developed further: DTMs specifically targeting words (Frermann and Lapata, 2016;Perrone et al., 2019) use bag-of-words to draw sense distributions for certain target words over time, dynamic, and continuous word embeddings (Bamler and Mandt, 2017;Rosenfeld and Erk, 2018;Rudolph and Blei, 2018;Yao et al., 2018;Dubossarsky et al., 2019;Gillani and Levy, 2019) differ from static embeddings in that they use the entirety of the data (i.e. all time bins) to create vector representations, and more recently contextualized word embeddings (which have token vectors and not type vectors 2 ) have been applied to diachronic corpora.
Thus, there have been robust attempts to evaluate and sometimes compare systems, but it remains difficult to determine what is actually being modelled, and whether the performances are transferable to other languages, different corpora, or dissimilar time spans. In short, it is arduous to determine whether NLP systems can be applied as-is to humanities data. Indeed, as McGillivray et al. (2019) remark, for example, despite being promising with regard to English, SCAN (Frermann and Lapata, 2016) performs poorly on an Ancient Greek corpus with sparse data and extended time bins, and the performance of an updated model (Perrone et al., 2019) does benefit from additional information such as literary genre.
As anyone who works with historical material is aware, language changes over time. To avoid anachronisms, one has to make sure that texts are understood in their own context, rather than through a contemporary lens. Although historians have been trained to do this, as the above paragraph shows, current NLP methods might not be completely fit for the task. Additionally, NLP usually focuses on relatively straightforward cases, 3 and it is unclear whether or not the signal picked up by computational models is useful for humanities research, given that the changes in meaning being studied may well not be as obvious . Finally, the computational processing of humanities data notoriously poses specific challenges both by its nature (evolving grammar and orthography, uneven size of data across time, for example) 4 and through how researchers can process it electronically (missing, incomplete, or wrong metadata, varying OCR quality, etc.) (Piotrowski, 2012).

Methodology
As we point out above, studying the changing vocabulary of a concept is no easy task. Doing so in a way that informs research in the humanities in a data-driven way adds a layer of complexity: if a certain, specific theme is to be studied it has to be defined a priori, and operational choices must be made 5 that might bias any quantitative method applied to the resulting subset of the data. Our methodological contribution is an approach that follows the fine line between having a precise research question and making use-in a datadriven way-of all the data available. It is a two-step approach, which we illustrate below in a case study on the changing vocabulary of nationhood in four countries and four languages.
To illustrate that the method is robust enough to tackle different data, languages, and periods, we carry it out on Dutch, Finnish, Swedish, and British newspaper data. The newspapers stem from different sources and countries and are available in different formats. Massive digitized newspaper collections are increasingly used to address historical questions through mining textual data. 6 The material, as well as the pre-processing steps, is laid out below, and the distribution of the data is available in Figs 1 and 2.
The Dutch data are the Delpher open newspaper archive (Royal Dutch Library, 2017) for the period 1618 until 1876 included. This archive is said to contain all newspapers for that period. 7 For the years 1877-1899 included, currently only available through the API, we queried the API for every item of the 'artikel' type ('article', the Dutch data have article segmentation and further differentiates between advertisements and articles) category containing the determiner de ('the') at least once. Although this does not guarantee a 100% recall, de is so frequent that we are confident the extreme majority of articles of the necessary length for our tasks are retrieved. For anything pre-1877, we discarded pages that had anything other than exclusively 'nl' or 'NL' as language tags in the metadata. Articles from colonial newspapers were systematically removed. This is motivated  the Kubhist 2 corpus digitized by the Royal Library of Sweden, processed with the Sparv pipeline (Borin et al., 2016), and made available online 9 by Språkbanken through Korp (Borin et al, 2012). Finally, the British data consist of the British Library Newspapers covering especially the 19th century, 10 the 17th and 18th Century Nichols collection, 11 and the 17th and 18th Century Burney collection. 12 The changes in corpus size over time bins poses a problem for any computational text-mining task. Our approach creates intermediate data points in separate time bins of 20 years, 13 and it is only the aggregate information that is compared over time. As such, common pitfalls related to aspects such as limited vocabularies or the representativity of the data do not necessarily apply, as we spell out in our evaluation.
A further issue is that our data (historical newspapers) are not only data in which we study changing language: the change in corpus size and the growing importance of newspapers as a medium are parts of the historical process in which the language of nationhood has also changed. Growing amounts of newspapers created a different habitat in which the vocabulary of national could flourish; hence, there is no reasonable way of even trying to achieve a balance with the corpus used for the purpose of computation. Rather understanding changes in the corpus and the development of the public sphere in general is a form of corpus control, which is essential in terms of understanding the changing vocabulary of nationhood Tolonen et al., 2019).

Extracting nationhood
First, using dependency parsing, 14 we utilized the method proposed by Hill et al. (2018) and extracted all the nouns modified by the adjective at hand, in our case 'national'. 15 With regard to the other languages we extracted nouns modified by nationaal and nationale in Dutch, nationella, nationell, and national in Swedish, and kansallinen in Finnish. Obviously, different languages have different properties. We resorted to splitting the Finnish and Swedish compound nouns starting with kansallis-and national-, respectively, while making sure they were genuine compounds, removed the 'national', and added the remaining part to our tally. As an example, Swedish nationalbiblioteket 'the national library' became nationell þ biblioteket, but we discarded nationaliteten 'the nationality' as it is a noun in its own right, 16 and the Finnish kansalliskirjasto 'national library' became kansallinen þ kirjasto. Only modified nouns are kept. For newspapers in Finnish and in Swedish from Finland, we used linguistic information made available by the language bank of Finland, 17 and similarly for newspapers in Swedish published in Sweden we used information made available by the language bank of Sweden 18 -both sets of data were produced by different versions of the same pipeline, Sparv (Borin et al., 2016). Dutch and English datasets were dependencyparsed using spaCy 2 (Honnibal and Montani, 2017). The large models were chosen for both languages. Unfortunately, no assessment of the quality of the dependency parsing is available for Finnish and Swedish. 19 The absolute counts of nouns modified by 'national' are displayed in Fig. 3. The relative frequencies show a similar pattern.
Because the meaning of 'national' changes over time, it is possible that other adjectives referred to what we now classify as national. To evaluate the 'centrality' of the adjective in the different time periods, therefore, we aggregated all other adjectives that modified the nouns modified by national. For example, in Dutch, this resulted in adjectives such as 'public', 'Dutch', 'royal', and 'foreign'. The frequencies of these 'competing' adjectives were lower in all decades, however, as well as in the overall time frame. This shows that the adjective 'national' was indeed the most commonly used to modify these nouns. The 'competing' adjectives sometimes perform a supplementary function but, as we will reveal, the discourse of national had a clear role of its own in all languages.

From words to concepts
Second, to allow the semantic clustering of all nouns relating to the concept of 'nation', we trained diachronic word embeddings on the entirety of the full text. Given that there was no conclusive way of determining what type of embedding was best for our data and that word embeddings are still poorly understood, and since we argue that dynamic and continuous word embeddings models cannot reliably be used here on account of the extremely uneven distribution of the data, we experimented with two fairly old architectures, CBOW and SGNS (Mikolov et al., 2013a,b), which have been studied more thoroughly. 20 For the A data-driven approach to study changing vocabularies Digital Scholarship in the Humanities, Vol. 36, Supplement 2, 2021 ii115 same reason, we created diachronic word embeddings using the two most frequently applied methods, post hoc alignment and incremental updating (described in detail below). We chose to train models on double decades for three reasons: first, 20 years roughly corresponds to a 'generation' in historical sociolinguistics (Säily, 2016); second, we needed a certain number of nouns related to the nation for the clustering to make sense, and bins of 20 years allow enough to be gathered, especially in the earlier periods; third, and somewhat echoing the second reason, it allowed us to have relatively stable models for the earlier periods. For each time bin, we trained two types of word embeddings using gensim ( Rehů rek and Sojka, 2010), a Python library for vector space modelling. Because separately trained vector spaces cannot be compared directly, we used two different methods to make the spaces comparable, and thus to ensure a sound diachronic approach. On the one hand, we followed Kim et al. (2014) and initialized the vector space for time bin t 1 with the space from t 0 , 21 and updated the vectors by continuing the training. This differs slightly from the original approach in setting the learningrate value of t 1 to that of the end of the previous model (in this case, t 0 ). The aim was to prevent the models from diverging too rapidly, as successfully reported in previous work based on the same data Pivovarova et al., 2019;Marjanen et al., 2020). These models are referred to later in this article as UPDATE. At the same time, we independently trained word embeddings for all time bins, which we then aligned post hoc as proposed by Kulkarni et al. (2015). The spaces were aligned by means of orthogonal Procrustes analysis, as first done by Hamilton et al. (2016). 22 We refer to these models later in this article as ALIGN. Aside from the frequency threshold, which we raised due to the enormous number of types 23 in our corpora, we used the default (hyper)parameters. 24 We are releasing the models along with this article. 25 Once the word embeddings were trained, we built, for each time bin, a similarity matrix between all the nouns extracted above. In other words, we queried the word-embedding models for a degree of 'semantic similarity' 26 between all words at hand and stored those relations in a table.
Semantic clusters can then be created. We used two hard clustering algorithms, which we describe briefly below.
We created the semantic clusters using k-means clustering (MacQueen, 1967) and affinity propagation (Frey and Dueck, 2007). The aim in k-means is to group similar data points together. Its main limitation, in our case, is that the number of clusters needs to be decided a priori. Our second clustering algorithm, affinity propagation, has the advantage of finding the number of clusters automatically: it splits the data into exemplars and instances, exemplars being representative tokens of their instances, the non-exemplar tokens in the same cluster. As Pivovarova et al. (2019) point out, 'Affinity Propagation has been previously used for several NLP tasks, including collocation clustering into semantically related classes (Kutuzov et al., 2017) and unsupervised word sense induction (Alagi c et al., 2018)'. Given that, just as in the above-cited article, we lacked a gold standard, we used standard hyperparameters 27 as available in the scikit-learn package (Pedregosa et al., 2011). The main weakness of affinity propagation remains the computational and memory costs: its O(n 2 ) 28 cost is limiting in larger datasets.
As can be inferred from the previous subsections, the main strength of our approach is that it allows researchers to rely on hypotheses stemming from historical research while being data-driven. To a certain extent, we used the entirety of the data available (for English, upwards of 50 billion words) while guiding the process-the only interference, which we admit is crucial and requires domain expertise, was choosing a key adjective on which to focus. The final product, fine-grained on a one-year basis, is refined enough to be analysed in broad strokes as well as to lead to deeper dives into specific periods. Through the use of the entirety of the data, and time-specific meaning representations of words, the method avoids the common trap of teleology. Unfortunately, an inherent weakness to type embeddings is that polysemous words have a single representation in vector space, 'ironing out' the polysemy. This is problematic in that some words might have a certain meaning in the context of the topic at hand that is not the main sense of the word, leading to bad clusters. 29

Evaluation
The method proposed in this article can be evaluated from two perspectives. First, intrinsically, we show that choices made in the preparation of the data and in the creation of the intermediate, aggregate data are reliable and produce sound output. Second, through a case study in four languages, we show that the method produces results that are useful for downstream tasks and analyses such as the study of a concept across large time scales.
Word embedding models are commonly evaluated using for example word analogies or word similarities. It should be pointed out that these evaluations are carried out on present-day data for which ground truth exists. To take but one example, Pennington et al. (2014) used the analogy task in Mikolov et al. (2013a) as well as the word similarities available in WordSim-353 (Finkelstein et al., 2002). However, ground truth is not available for our data. Were we to find enough annotators 30 to create ground truth and evaluate our embeddings, creating such ground truth would entail creating an unreasonable amount, 31 given that we are training different models on different time bins. 32 This is well beyond the scope of this project. Finally, as Chiu et al. (2016) point out, there is no guarantee that intrinsic evaluations of word embeddings such as described above indicate better performance in downstream tasks. Instead, we rely on recent conclusions reported by Hill and Hengchen (2019), who point out that, on historical, OCRed, relatively dirty data (i.e. texts with an F-score of $0.75 compared with their corresponding keyed-in ground truth) does not severely impact the performance of vector space models.
Following this, we performed a manual evaluation on certain words to make sure that the models output semantically similar words. The models for all languages except Swedish output words that were deemed correct. 33 As a result, we retrained the Swedish word embeddings after performing some data alteration: we only kept sentences that were at least ten tokens long and for which the Sparv processing pipeline could find at least 50% of lemmas.
Our manual checking confirmed that the word similarities for all languages and models post-1700 (where available) seemed meaningful, and that OCR errors were indeed captured and deemed similar. 34 Similarly, all clusters-either with k-means or affinity propagation-were meaningful, as illustrated in the example in Fig. 4. Plotting clusters and their evolution across time-clusters are given weight through a frequency count of their members-showed the expected signal. For example, Fig. 4 35 shows the 1860-1880 situation in Finnish-language Finnish newspapers. The 1863 peak for the legislative cluster A data-driven approach to study changing vocabularies Digital Scholarship in the Humanities, Vol. 36, Supplement 2, 2021 ii117 (hallitus, hallituksen, puolueen, hallitns, sota) 36 stems from content about the 1863-1864 session of the Diet of Finland, 37 the legislative assembly of the Grand Duchy of Finland. The red cluster exploding in 1871 (kokous-kokouksen-kokouksessa-kokouksellekokoukseen) 38 stems largely from texts relating to the Franco-Prussian War.
As we are proposing a method for which there is no gold standard and to which the notion of 'absolute truth' cannot be applied, the only way to determine whether the approach serves a purpose is to establish its usefulness. 39 The second, extrinsic, part of our evaluation is described in the next section.

Findings
Harnessing word embeddings to cluster words is a powerful and useful tool when matched with the right kind of research questions. In the case of the expanding discourse of 'national', for example, our clustering proves the expansion of the vocabulary of nationhood. This does not as such challenge existing historiography, but clusters based on affinity propagation indicate this change in all four languages such that the clusters make sense to a reader with historical knowledge of the period. In English, for instance, we show (in Fig. 5) how affinity propagation produces only one cluster from the time bins from the 18th century, indicating that the language of national was tied to issues related to the military and the economy (debt in particular). Earlier research focusing on the history of economic thought has pointed this out (Hont, 2005), but perhaps because of the focus on the economy, the point has not been widely accepted in the literature. Our analysis on the totality of the material does point to a dominance of economic and military discourse in the period when conceptualizing things as national started to become more common. As expected, we also show that the era of the French revolution heralded a period of gestation in which national themes were associated with political and, to a certain extent, sentiment-related themes. This entailed a clear expansion of the conceptualization of what could be perceived as national. This process continued and, consequently, affinity propagation provides many more distinct clusters for the early 19th century.
The clustering for Swedish, Finnish, and Dutch follows a similar pattern, but there are some differences in the timing and contents. In Finnish, for instance, the word kansallinen (as a translation of the national) did not become really frequent before the 1850s (depending on the threshold), so naturally the development is different with a much quicker expansion of the vocabulary as established notions of nationhood readily translated into Finnish from Swedish, German, and English. As such, this suggests that the findings resonate with historical knowledge of the period and could therefore be used to further explore national peculiarities with regard to the vocabulary of nationhood.
One way of looking at the cluster differences in the case of national is to pay attention to the nature of the clusters and not only to the linked individual words. Although focusing on the 19th century, such an approach results in a clear division between sentimentbased (feeling, spirit, pride, prejudice) and objectbased (bank, schools, council, government) nouns related to nationhood. When studying nationalism, this is a crucial division as they direct our attention to the growth of discourses relating to identity and affinity on the one hand and state institutions on the other. As such, they channel attention to the hypotheses about culturalization and institutionalization mentioned in Section 1. The aim in this article was not to make a full-fledged historical argument; however, it is enough to observe that, in the study of nationalism, the clusters produced through affinity propagation are perhaps more precise than what a historian reading texts would consider relevant themes, but at the same time, more detailed clusters could be thematically grouped and would seem to capture a greater level of (sometimes conflicting) signals in the data than a human reader could. We may now begin to examine in a data-driven way how different types of attitudes to nationhood emerge over time and in different places and languages.
Our distributional methods based on affinity propagation performed well in tracing general development with regard to nationhood, but also point towards more detailed findings that could be evaluated from the perspective of historical change. As such, we come much further from the use of keyword searches, simple plots of relative frequency, or even topic models in providing methods for diachronic change that relates to theories of long-term historical change. It should also be possible to use the method in analyses of other themes such as (the changing vocabularies) of secularization, modernization, and the process of civilization.

Conclusion
The aim in this article was to develop a data-driven method using word embeddings to examine how nation-states became central units for international politics in the 19th century Europe. The study relied on large digitized newspaper datasets in four different languages. To our knowledge, such a large-scale comparative study that grasps long-term development in as many as four languages and is statistically robust has not been attempted before. A major strength of this article is that by design, it is not limited to the study of nationhood but extends beyond it to different research questions and is thus reusable in varying contexts. Word embeddings, which are also at the core of the method in this article, have recently gained popularity in NLP, but their successful use in historical studies is not so evident. Although there have been robust attempts to evaluate and sometimes compare NLP methods, it remains difficult to determine what is actually modelled in different cases, and whether the performances are transferable to other languages, different corpora, or dissimilar time spans. In semantically clustering, all nouns relating to the word 'national', we trained diachronic word embeddings on the entirety of the full-text historical newspaper corpora at our disposal in Dutch, Swedish, Finnish, and English. We used both k-means and affinity propagation clustering, of which the latter seems to provide results that are more intuitive to a domain expert. Given that there is no safe way of determining what type of embedding best suits our purpose, and that no dynamic and continuous word-embedding models could be reliably used due to the extremely uneven distribution of the data, we experimented with two relatively old architectures (CBOW and SGNS). This turned out to be a good, pragmatic choice: our manual evaluation showed that the models output semantically similar words and that the clustering lends itself to historical interpretation. As evaluation in the sense of using a gold standard is not possible, further evaluation of the method is to conduct more case studies that would allow deeper interpretations of changing vocabularies related to historical processes.  A manual evaluation of the results for all time bins indicates relatively high precision (i.e. output is made up of nouns), but we have no information regarding recall. For more on dependency parsing on (historical, OCRed) newspapers, see van Strien et al. (2020). 20 See for example: Antoniak and Mimno (2018) and Mimno and Thompson (2017). 21 And thus t 2 with t 1 , and t 3 with t 2 , etc. 22 We used the code provided by Ryan Heuser, whom we thank. A copy is available at: https://gist.github.com/faus tusdotbe/5a87007aaccc1342608c049af83fc5d2. As the code effectively deletes vectors that are not in all time bins, we made sure our nation-related nouns were not deleted. November 2020). 26 For words w 1 and w 2 , the similarity score is (w1Áw2)/ (kw1k kw2k), where w 1 and w 2 are, respectively, the L2normalised vectors for w 1 and w 2 and k Á k denotes the Euclidean norm. 27 Although the number of clusters cannot be set, the preference hyperparameter (which defines the 'will' of an item to be an exemplar) can be tuned. 28 The 'Big O notation' is used to describe how calculation time or space requirements grow as the input of a certain algorithm grows. In the case of O(n 2 ), this means that the growth is quadratic: with an input of 10 the requirement is 10 2 ¼ 100, with an input of 100 the requirement is 100 2 ¼ 10,000, etc. 29 We did not find that to be the case in our case study, and do not expect it to be a large problem. 30 In general, between three and five annotators are needed to be able to calculate satisfactory inter-annotator agreement. See Schlechtweg et al. ( , 2020 and Schlechtweg and im Walde (2018) for a discussion on creating annotated ground-truth in diachronic corpora, and particularly on how domain experts (in that case, historical linguists) have better agreement scores-reinforcing our intuition that domain experts are needed. 31 Additionally, we echo similar work  in stating that diachronic word embedding models finetuned on one task are not necessarily perfect for other tasks. 32 Even if the OCR tool used is the same across all time bins, there is still variation in the input data (fonts, columns, etc.), forcing us to be thorough, and to evaluate all models. 33 A more thorough analysis should be conducted, but it seemed that the model could not abstract from the structure of definites and indefinites: for example, brev 'letter' and brevet 'the letter' were deemed less similar than brevet and gardet 'the guard'-two tokens that share nothing apart from their suffix. Note: this observation was made on the first version of the corpus (Kubhist), the final embeddings used and released were trained on the larger, more recent version (Kubhist 2). 34 The reason we do not delve into the difference between clusters created with ALIGN and UPDATE models is that we could not find a meaningful difference. 35 The clusters were created using k-means with eight clusters. Different cluster sizes for the same time bin show the same behaviour. 36 Literally: 'government' in the nominative, 'government' in the genitive, 'political party' in the genitive, 'government' in the nominative with an OCR error, 'war' in the nominative. 37 Note that even if the original noun (waltiopäivät, valtiopäivät, 'state diet') is not present as a keyword because it was not modified by 'national', the event is still present. 38 Literally 'meeting/assembly' in different cases. 39 For a more extensive discussion on the notion of 'fitness for use', see Boydens (1999).