-
PDF
- Split View
-
Views
-
Cite
Cite
Benjamin Gittel, Thomas Haider, The ongoing birth of the narrator: empirical evidence for the emergence of the author–narrator distinction in literary criticism, Digital Scholarship in the Humanities, 2025;, fqaf013, https://doi.org/10.1093/llc/fqaf013
- Share Icon Share
Abstract
This article explores the historical evolution of the distinction between author and narrator in German-language literary criticism, an area largely unexplored by quantitative methods. While narratologists often distinguish between a fictional narrator and the author, the practical adoption of this distinction by readers remains under-examined. We hypothesize a semantic shift in the term ‘narrator’ from referring to the actual author to a fictive entity imagined by readers, indicative of modern fiction practices. Our methodology combines manual annotation with computational analysis of historical periodicals (1841–2018) to track this semantic change. We manually annotated instances of ‘narrator’ (germ. ‘Erzähler’) differentiating four different word senses: oral narrator, author of a narrative, fictive heterodiegetic narrator, and fictive homodiegetic narrator. We train different BERT models to recognize and visualize these word senses. Finally, we employ cross-validated models in a diachronic large-scale analysis, finding that the term ‘narrator’ gradually changed its meaning from denoting the actual author of a narrative to meaning a fictive entity that the reader of fiction has to imagine. There are two surprising observations: First, this change is still ongoing and second, it is mainly driven by the increase of the homodiegetic narrator word sense, rather than by the word sense that narratologists attach particular importance to—‘fictive heterodiegetic narrator’—which is even after the year 2000 much less frequent than other word senses.
1. Introduction
Narratologists today typically assume that all fictional narratives have a fictive narrator that is to be distinguished from its author (Martínez and Scheffel 1999: 71f.; Schmid 2008: 77–81, esp. 81; Lahn and Meister 2016: 73–74),1 although this claim has been contested more recently on theoretical grounds (see Walsh 2007: ch 4; Kania 2005; Currie 2010; Köppe and Stühring 2011; Patron 2016[2009]; Zipfel 2015; and the contributions in Patron 2021). What this debate seldom mentions is the fact that the distinction between author and narrator describes a way of dealing with literary texts whose ultimate empirical correlate is a behavioral pattern of readers (see Fludernik 2003: 333; Jajdelska 2007: 4f.). If readers were not interested in the distinction—and for centuries they arguably did not care much—narratologists could still make the distinction, but it would arguably not have the importance which is attached to it today. Existing research on the history of the author–narrator distinction deals mainly with the discourse about the distinction in poetics and literary theory by identifying different periods in different language-bound traditions in which the distinction has been debated (see Patron 2016[2009], 2021). However, the history of the distinction in readers’ practice has been researched much less: how and when the distinction between author and narrator was established in reading practice(s) is largely obscure, although explorative studies indicate that it was by no means fully established during the 19th century, that is, readers did not regularly use the distinction when reading a fictional text as evident from reception testimonials (see Pieper 2015; Dawson 2016; Gittel 2021). The role of the distinction in historical literary reading practices has to our knowledge not been studied systematically by quantitative methods, which is not very surprising, because it is unclear by which methods one may investigate whether and, if so, when historical readers have adopted a narrative instance that is distinct from that of the author. It is even more uncertain how this could be investigated using quantitative methods, which generally are an important component of the investigation of social practices, insofar as they are inter-individual behavioral regularities that are structured by social rules in the broadest sense (Rawls 1955: 3, fn. 1; Tuomela 2002: 94–6).
Against this background, the present article introduces an annotation-based and computational approach to study the history of the author–narrator distinction in literary criticism on a broad empirical basis using large language models. Specifically, we analyze the semantic change of the term ‘narrator’ in literary criticism under the hypothesis H that the term gradually shifted its meaning2 from the actual author of a narrative to a fictive entity that the reader of fiction has to imagine according to literary conventions. This hypothesis has two parts, which can be assessed independently:
(H1) The word sense ‘fictive narrator’, who is not part of the narrated world (called ‘heterodiegetic narrator’ in narratology), becomes more frequent over the period of analysis.
(H2) The word sense ‘author of a narrative’ becomes less frequent over the period of analysis.
By analyzing the semantic change of the term ‘narrator’, we gather quantitative empirical evidence for the process in which the (modern) ‘practice of fiction’ consisting of reading and production practices (Lamarque and Olsen 1994: esp. 32–4; Gittel 2021), gradually establishes itself in literary criticism. By ‘literary criticism’ we mean the literary institution that is dedicated to ‘the informative, interpretive and evaluative discussion of primarily newly published literature and contemporary authors in the mass media’ (Anz 2004: 194). Literary criticism in this sense is different from literary studies, although, nowadays, many literary critics that write reviews and related text types are literary scholars or have been educated in literary studies (Hohendahl 1985; Magerski 2004; Shattock 2013).
The present article illustrates that the research question of how and when the distinction between author and narrator has been established in reading practice can be addressed successfully by taking a quantitative semantic approach. To explain it, we need to expand a little. Reading practices generally consist of certain rules or conventions that can only be observed indirectly through inter-individual behavioral regularities, sanctioning behavior, or certain reflections of the actors on these conventions. An important, repeatedly postulated rule of modern fiction reading practice can be formulated roughly as follows, ‘Imagine, on the basis of fiction-internal sentences, that a narrator (not the author) tells us that p is the case in the fictional world’. (Alward 2007; Bareis 2008: ch. 3.3). This postulated rule has a descriptive component (it is intended to describe what readers generally do when they act within the framework of modern fiction practice) and it has a normative component (readers can be criticized if they do not follow it, for example at school). Obviously, the concept of the narrator figures prominently in the postulated rule; moreover, it is not at all clear how an equivalent rule could be formulated without the concept of the narrator as distinguished from the author. Therefore, we believe that a semantic differentiation of the concept of narrator, according to which it can refer not only to real-existing but also to fictive entities, is a semantic prerequisite for the establishment of the aforementioned rule of fiction reading practice.3 Once the concept of a fictive narrator becomes established, there is a (concise) answer to the question ‘who else speaks in the narrative if not the author?’, which in turn allows to popularize and possibly institutionalize the author–narrator distinction. (How exactly theoretical considerations on this distinction in literary theory or narratology on the one hand and reading practice on the other relate causally is an exciting question beyond the scope of this paper, although some of our findings allow to generate first hypotheses regarding certain time periods. In principle, it is conceivable that the theoretical distinction acts as a driving force for the enforcement of the rule or that the theoretical considerations reflect certain tendencies in reading behavior, especially of expert readers.)
In line with the argument given, manual annotation of word senses plays a central role in this study, serving as the foundation for understanding and categorizing fictive and non-fictive word senses of instances of the ‘narrator’ term, following a semasiological approach (Geeraerts 2009: 26ff.; Glynn 2015). Evaluating the manual annotation, in turn, is a condition for generalization: By training classifiers to discriminate ‘narrator’ senses and evaluate how good computational models are at encoding the fictivity dimension of the German term ‘Erzähler’ (narrator), we lay the ground for an automated large-scale analysis with large language models. Contextualized embeddings (derived from large language models) are well-suited for tracking semantic change over time (Wiedemann et al. 2019; Rother, Haider, and Eger 2020; Schlechtweg et al. 2020). However, the dependence of these models on specific domains and text types, and the complexity of the language in literary criticism present challenges, highlighting the need to assess their capacity to encode relevant dimensions of the German-language literary criticism domain we are interested in.
The remainder of the article is structured as follows: In Section 2, we will outline work from literary studies and computational linguistics related to the research question. Section 3 describes two appropriate corpora for studying long-term semantic change in German-language literary criticism. Section 4 presents our collaborative manual annotation including gold-standard evaluation and diachronic analysis. In Section 5, we examine the ability of BERT language models to distinguish between ‘narrator’ word senses, including an error analysis and visualization of the resulting embeddings, in order to predict the categories large-scale with performant models. The large-scale analysis is carried out with the main objective to track semantic change over time and the secondary objective to compare different reader groups (scholars vs. non-scholars). The last section summarizes the work and outlines future work.
2. Related work
2.1 Related work in literary studies and literary theory
While the history of the author–narrator distinction is not an established research field, researchers have dealt with it in three research areas: (1) Diachronic Narratology, (2) History of Fictionality, and (3) History of Literary Studies and Poetics. This section is a brief survey of these research areas.
Most standard works in narratology assume that all fictional texts have a fictive, text-internal narrator (see, e.g. Martínez and Scheffel 1999: 71f.; Schmid 2008: 77–81, esp. 81; Lahn and Meister 2016: 73–4). Recently, this view has been contested on theoretical grounds, with so-called optional-narrator theories evolving (i.e. Kania 2005; Walsh 2007: ch 4; Currie 2010; Köppe and Stühring 2011; Patron 2016[2009]; and the contributions in Patron 2021). According to these theories, one should only assume a fictive narrator if the fictional text lends evidence for this assumption (e.g., if the narrator has properties that differ from the ones of the actual author), otherwise the author should be regarded as telling the story. What this theoretical debate usually omits is that the author–narrator distinction is historically bounded insofar as it can be used to describe practices of dealing with literary texts (see Fludernik 2003: 333; Jajdelska 2007: 4f.). The tendency to fall behind this insight becomes apparent when researchers tell the history of the author–narrator distinction as a progress history toward modern narratology (see Orgis 2019: 103 or Lahn and Meister 2013: 23 who claim that the distinction had to be ‘discovered’ by early narratologists like Käte Friedemann in the beginning of the 20th century). On the other hand, there are explorative analyses indicating that the distinction was by no means fully established during the 19th century (see Pieper 2015; Dawson 2016; Gittel 2021).
The distinction between author and narrator is often regarded as crucial for today’s critical practice of approaching fictional narratives—and has even been proposed as a definitional criterion of fiction (Genette 1993[1991]: 70), a position neglected inter alia by Gertken and Köppe (2009). However, with the gradual assertion of an institutional theory of fiction (e.g. Lamarque and Olsen 1994; Köppe 2014; Zipfel 2016), the history of fictionality has also started to change: if fictionality is increasingly understood as a social practice, which is essentially determined by sets of rules for authors and readers and their shared knowledge of these rules, the objective of a history of fictionality becomes the description of these rules and their changes over time (see the contributions in Gittel 2020). One of the central rules of today’s practice of fiction can be understood as the ‘rule of imagination’, which requires the reader to imagine a fictional world. However, this rule can be understood in two ways: either the reader is to imagine that something is the case in the fictional world (imagining that p), or the reader is to imagine that a narrator reports that something is the case in the fictional world (imagining that N tells that p; Walton 1990: 365–8; Banfield 2015[1982]: 183–224). A more nuanced position holds that the decision which ‘reception frame’ to adopt can have certain functions within the interpretation of concrete texts (Korthals Altes 2014: 146).
In histories of German-language literary studies and histories of narrative theory, the distinction between author and narrator is most often associated with the work of Oskar Walzel’s student Käte Friedemann, ‘The Role of the Narrator in Epic’ (Friedemann 1965[1910]: esp. 26f.; see Frey 1948: 275; Schmid 2008: 11–12; Lahn and Meister 2013: 22–23). Almost simultaneously, Margarete Susman (1910, 16–19) postulated a categorical separation between author and speaker (germ. lyrisches Ich) for lyric poetry. Sometimes, it is assumed that the distinction between author and narrator did not become widespread in German studies until the work of Wolfgang Kayser and Franz Stanzel in the 1950s and 1960s (Pieper 2019, ch. 2.4, esp. 266; on the French and English-speaking world, see Patron 2021). Recent research, however, suggests that the history of the establishment of the distinction in German studies under rubrics such as ‘novel technique’ (Romantechnik) and ‘narrative technique’ (Erzähltechnik) has a pre-history in poetological texts, especially of the 19th century (Grüne 2014, 2016). Since corresponding research contributions usually rely on a few selected testimonies (e.g. Patron 2021: 110–1), it seems promising in general to employ a quantitative large-scale approach as in the present paper.
2.2 Related work in computational linguistics
Studies about semantic language change play a crucial role in the broader context of understanding the evolution of language over time (Kutuzov et al. 2018; Tahmasebi, Borin, and Jatowt 2021) and have also provided ample evidence for social change reflected in language (Hamilton, Leskovec, and Jurafsky 2016a, 2016b; Mendelsohn, Tsvetkov, and Jurafsky 2020). And while (computational) historical linguistics has traditionally investigated sound change (Borin and Saxena 2013; List, Greenhill, and Gray 2017), research concerning the change in meaning at the word level has evolved into a distinct area in computational linguistics, commonly referred to as Lexical Semantic Change. Typically, this research is driven by its utility for historical semantics and historical lexicography (Tahmasebi et al. 2022) and is commonly examined either from an onomasiological standpoint, which investigates the words used to express a particular meaning (Geeraerts et al. 2024), or from a semasiological perspective, which explores the various senses a word can express over time (Geeraerts 2009, 26ff.; Glynn 2015; Baldissin, Schlechtweg, and Schulte im Walde 2022).
Computational approaches offer the benefit of generating predictions about semantic change for substantial data sets, allowing us to examine a greater amount of data, thereby reducing the need for extensive human involvement and mitigating sampling bias (Schlechtweg 2023). Especially methods from distributional semantics have shown their utility. These approaches can be roughly divided into so-called type models on the one hand, where embedding matrices are computed for separate time steps and then the vector spaces are aligned (Eger and Mehler 2016; Dubossarsky et al. 2019; Haider and Eger 2019), and so-called token models on the other hand, where the focus are not word types and their aggregated representation but separate representations for each token, based on their sentential context. From the latter, contextualized embeddings (derived from large language models) are a particularly good choice to address our research question, as they have been shown to capture meaning variation through encoding context (Laicher et al. 2021; Kutuzov, Velldal, and Øvrelid 2022). The resulting vector representations of such encodings can be used to reliably track semantic change over time (Wiedemann et al. 2019; Rother, Haider, and Eger 2020; Schlechtweg et al. 2020). However, previous research has also noted that the language models used to extract meaning representations are fairly domain dependent (Augenstein, Derczynski, and Bontcheva 2017; Field and Tsvetkov 2019; Xu et al. 2020) and that the language of literary (adjacent) writing presents a particular challenge (Bamman, Popat, and Shen 2019; Sims, Park, and Bamman 2019; Brunner et al. 2020). Thus, it will be paramount to test the ability of models to encode domain specific dimensions of interest.
3. Corpora
In order to study long-term semantic change in literary criticism we used two German-language corpora, each covering a relatively long period of time, the journal Die Grenzboten (1841–1922) and the weekly newspaper Die Zeit (1946–2018) (Geyken et al. 2017; Barbaresi 2021). Die Grenzboten (‘The Border Messengers’, Nölte et al. 2016) was a German weekly (partly bi-weekly) political and literary journal founded in 1841 by Ignaz Kuranda. The journal was published in Leipzig and became one of the most important periodicalsin Germany in the 19th century. Die Zeit is a national German weekly newspaper that first appeared on February 21, 1946 and still exists today. It is regarded as one of the most influential highbrow newspapers in Germany with an extensive feuilleton. In terms of size and temporal distribution, the two corpora are fairly different. The Grenzboten corpus contains a total of 64 million tokens over 2.8 million sentences, while the Zeit corpus contains 563 million tokens, spread over 25.8 million sentences (thus being magnitudes larger).
Both corpora contain, besides reviews, all sorts of other text types (from short stories to political essays). The Zeit corpus includes metadata in the form of content sections (Ressorts). We focused on the newspaper section ‘culture’ (Kultur). The Grenzboten corpus does not come with proper text type annotation. However, the target concept we are interested in (‘Erzähler’ [narrator], and its morphological variants, see Section 4.2) predominantly occurs in contexts that are concerned with literary criticism in the broadest sense. German-language literary studies that deal with contemporary (in contrast to old/medieval texts) only constitute in the last third of the 19th century, a process that goes on till 1910 at least (Weimar 2003: 432–44), so the vast majority of people publishing in the Grenzboten are no literary scholars or scientists, as is true for people publishing in the Zeit culture section.
In light of the limited size of the Grenzboten corpus, we refrained from a document classification (into criticism vs. non-criticism) over such a long period (1841–1922), which would constitute a separate research project. In addition to the relatively small early corpus, a limitation of this data is the temporal gap between the two corpora from 1922, where Grenzboten ends, to 1946, where the Zeit corpus starts, excluding not only the analysis of the missing 1930s, but also the analysis of the 1920s and 1940s, where we have data only partially available. To close this gap, we rely on a third corpus compiled from a journal of literary criticism for large-scale analysis. It is called Deutsche Vierteljahrsschrift für Literaturwissenschaft und Geistesgeschichte (1923–2009, DVjs) and contains 800k sentences over 2.58 million tokens. This journal played an important role in the history of the establishment of literary studies as an academic discipline. Beside these considerations, the choice of these three specific periodicals was made for pragmatic reasons, above all the availability of the corpora.
4. Manual annotation
Manual annotation is an essential element of the present study because it allows one to test whether distinctions made by literary scholars in the field of narratology can be successfully applied to word sense disambiguation. Furthermore, a manual annotation process should generate high-quality annotation data for training and evaluating machine learning models or conducting experiments more generally, ensuring consistency, accuracy, and reliability.
To ensure the quality of our data, we used gold-standard evaluation (discussion among the annotators) and calculated inter-annotator agreement. The iterative process of gold-standard evaluation helped us to repeatedly improve the annotation guidelines4 (Gaidys et al. 2017, Gius and Jacke 2017).
To our knowledge there have been no prior attempts to operationalize word senses of the ‘narrator’ or ‘Erzähler’ concept. In pilot experiments, we tested the feasibility of a so-called semantic distance annotation (Erk, McCarthy, and Gaylord 2013; Brown 2008; Schlechtweg, Schulte im Walde, and Eckmann 2018) and also derived sense clusters from language models in an unsupervised manner (McCarthy, Apidianaki, and Erk 2016; Wiedemann et al. 2019; Yenicelik, Schmidt, and Kilcher 2020; Soler and Apidianaki 2021). However, we found that these approaches did not yield practicable categories. Instead, we opted for a theory-driven categorical annotation of word senses of the narrator concept.
The remainder of this section introduces the annotation categories (Section 4.1) and describes the annotation workflow, challenges during the annotation and the inter-annotator agreement (Section 4.2). Finally, we show the temporal distribution of the annotation categories (Section 4.3).
4.1 Categories
To develop the annotation categories, we used a top-down categorization approach: based on canonized German dictionaries (Wahrig-Burfeind 2008; Dudenredaktion 2020) and narratology reference works (Martínez 2011; Hühn et al. 2014; Hühn et al. 2022) we identified four basic meanings of ‘Erzähler’ (narrator):
oral narrator (a real person who tells or retells a story orally);
the author of a narrative text (the author of a written piece of narrative);
a fictive narrator who is not part of the narrated world (called ‘heterodiegetic narrator’ in narratology);
a fictive narrator who is part of the narrated world (called ‘homodiegetic narrator’).5
Although narratology provides many more possible distinctions, we decided to limit ourselves to these categories, because the intended annotation deals not with the narrative texts themselves but with texts about them. During the annotation, these categories have proven to be viable and exhaustive for reviews (except for metaphoric uses of the term6 and uses of the term in fictional narratives, which occasionally can be found in the Grenzboten corpus and were excluded). Table 1 illustrates the typology of ‘narrator’ word senses.
Example . | Translation . | Category . |
---|---|---|
W. Scherer macht in seiner „Geschichte der deutschen Dichtung des 11. und 12. Jahrhunderts” (S. 113) auf die Schilderung aufmerksam, die Ranke (Zur Geschichte der italienischen Poesie S. 19 f.), von dem Raccontatore in den Straßen von Venedig giebt: Auf der Riva Schiavone zu Venedig sieht man diesen Erzähler; alle Tage, wenn Feierabend gemacht wird, seine Zuhörer um sich sammeln. (Grenzboten 1878) | W. Scherer, in his “History of German Poetry of the 11th and 12th Centuries” (p. 113), draws attention to the account that Ranke (Zur Geschichte der italienischen Poesie p. 19 f.) gives of the raccontatore in the streets of Venice: On the Riva Schiavone in Venice one sees this narrator; every day, when work is finished, gathering his listeners around him. | A Oral Narrator |
Wenn dennoch Heliodor selbst von einem Tasso als guter Erzähler gepriesen wird, wenn Racine diesen Roman in dem Grade liebgewann, daß er ihn auswendig lernte, als seine Lehrer ihm zwei Exemplare nacheinander confiscire und verbrannt hatten, so kann mit jenem Lobe nur gemeint sein, daß der Verfasser in planvoller Weise die Abenteuer seiner Helden zur Gesamthandlung verbunden, und daß er sie anziehend geschildert hat. (Grenzboten 1857) | If Heliodor himself is praised by a Tasso as a good narrator, if Racine loved this novel to the extent that he learned it by heart when his teachers had confiscated and burned two copies in a row, then this praise can only mean that the author has connected the adventures of his heroes in a planned way to the overall plot and that he has described them attractively. | B Author |
Nicholson Bakers Buch ist die Wiedergabe eines langen nächtlichen Telephongesprächs, in das der Erzähler lediglich durch knapp gesetzte „sagte er“ oder „sagte sie“ hineinregiert. Ansonsten wörtliche Rede im Dienste wechselseitiger Erregung. Jim und Abby, er von der Westküste, sie aus dem Osten der USA, haben sich durch einen erotischen Kontaktservice kennengelernt und sich nach kurzem Abtasten in der Öffentlichkeit eines Telephonpools in ein „elektronisches Hinterzimmer“ zurückgezogen. (Zeit 1992) | Nicholson Baker’s book is a rendition of a long late-night telephone conversation, into which the narrator merely interferes with tersely placed “he said” or “she said”. Otherwise verbatim speech in the service of mutual excitement. Jim and Abby, he from the West Coast, she from the eastern U.S., have met through an erotic contact service and, after briefly feeling each other out in the public of a telephone pool, have retreated to an ”electronic back room.” | C Heterodiegetic Narrator (fictive) |
Es ist ja alles in einem: das Drama eines Wahnsinnigen, der Schiff und Mannschaft um einer fixen Idee willen in den Untergang treibt; der Bericht eines Matrosen (des Erzählers Ishmael) über seine Schicksalsgenossen, über den Schiffbruch und seine Errettung; schließlich die sachkundige Abhandlung über Wale und Walfang. (Zeit 2006) | It is all in one: the drama of a madman who drives his ship and crew to their doom for the sake of an obsession; the report of a sailor (the narrator Ishmael) about his companions in misfortune, about the shipwreck and his salvation; and finally the expert treatise on whales and whaling. | D Homodiegetic Narrator (fictive) |
Example . | Translation . | Category . |
---|---|---|
W. Scherer macht in seiner „Geschichte der deutschen Dichtung des 11. und 12. Jahrhunderts” (S. 113) auf die Schilderung aufmerksam, die Ranke (Zur Geschichte der italienischen Poesie S. 19 f.), von dem Raccontatore in den Straßen von Venedig giebt: Auf der Riva Schiavone zu Venedig sieht man diesen Erzähler; alle Tage, wenn Feierabend gemacht wird, seine Zuhörer um sich sammeln. (Grenzboten 1878) | W. Scherer, in his “History of German Poetry of the 11th and 12th Centuries” (p. 113), draws attention to the account that Ranke (Zur Geschichte der italienischen Poesie p. 19 f.) gives of the raccontatore in the streets of Venice: On the Riva Schiavone in Venice one sees this narrator; every day, when work is finished, gathering his listeners around him. | A Oral Narrator |
Wenn dennoch Heliodor selbst von einem Tasso als guter Erzähler gepriesen wird, wenn Racine diesen Roman in dem Grade liebgewann, daß er ihn auswendig lernte, als seine Lehrer ihm zwei Exemplare nacheinander confiscire und verbrannt hatten, so kann mit jenem Lobe nur gemeint sein, daß der Verfasser in planvoller Weise die Abenteuer seiner Helden zur Gesamthandlung verbunden, und daß er sie anziehend geschildert hat. (Grenzboten 1857) | If Heliodor himself is praised by a Tasso as a good narrator, if Racine loved this novel to the extent that he learned it by heart when his teachers had confiscated and burned two copies in a row, then this praise can only mean that the author has connected the adventures of his heroes in a planned way to the overall plot and that he has described them attractively. | B Author |
Nicholson Bakers Buch ist die Wiedergabe eines langen nächtlichen Telephongesprächs, in das der Erzähler lediglich durch knapp gesetzte „sagte er“ oder „sagte sie“ hineinregiert. Ansonsten wörtliche Rede im Dienste wechselseitiger Erregung. Jim und Abby, er von der Westküste, sie aus dem Osten der USA, haben sich durch einen erotischen Kontaktservice kennengelernt und sich nach kurzem Abtasten in der Öffentlichkeit eines Telephonpools in ein „elektronisches Hinterzimmer“ zurückgezogen. (Zeit 1992) | Nicholson Baker’s book is a rendition of a long late-night telephone conversation, into which the narrator merely interferes with tersely placed “he said” or “she said”. Otherwise verbatim speech in the service of mutual excitement. Jim and Abby, he from the West Coast, she from the eastern U.S., have met through an erotic contact service and, after briefly feeling each other out in the public of a telephone pool, have retreated to an ”electronic back room.” | C Heterodiegetic Narrator (fictive) |
Es ist ja alles in einem: das Drama eines Wahnsinnigen, der Schiff und Mannschaft um einer fixen Idee willen in den Untergang treibt; der Bericht eines Matrosen (des Erzählers Ishmael) über seine Schicksalsgenossen, über den Schiffbruch und seine Errettung; schließlich die sachkundige Abhandlung über Wale und Walfang. (Zeit 2006) | It is all in one: the drama of a madman who drives his ship and crew to their doom for the sake of an obsession; the report of a sailor (the narrator Ishmael) about his companions in misfortune, about the shipwreck and his salvation; and finally the expert treatise on whales and whaling. | D Homodiegetic Narrator (fictive) |
Example . | Translation . | Category . |
---|---|---|
W. Scherer macht in seiner „Geschichte der deutschen Dichtung des 11. und 12. Jahrhunderts” (S. 113) auf die Schilderung aufmerksam, die Ranke (Zur Geschichte der italienischen Poesie S. 19 f.), von dem Raccontatore in den Straßen von Venedig giebt: Auf der Riva Schiavone zu Venedig sieht man diesen Erzähler; alle Tage, wenn Feierabend gemacht wird, seine Zuhörer um sich sammeln. (Grenzboten 1878) | W. Scherer, in his “History of German Poetry of the 11th and 12th Centuries” (p. 113), draws attention to the account that Ranke (Zur Geschichte der italienischen Poesie p. 19 f.) gives of the raccontatore in the streets of Venice: On the Riva Schiavone in Venice one sees this narrator; every day, when work is finished, gathering his listeners around him. | A Oral Narrator |
Wenn dennoch Heliodor selbst von einem Tasso als guter Erzähler gepriesen wird, wenn Racine diesen Roman in dem Grade liebgewann, daß er ihn auswendig lernte, als seine Lehrer ihm zwei Exemplare nacheinander confiscire und verbrannt hatten, so kann mit jenem Lobe nur gemeint sein, daß der Verfasser in planvoller Weise die Abenteuer seiner Helden zur Gesamthandlung verbunden, und daß er sie anziehend geschildert hat. (Grenzboten 1857) | If Heliodor himself is praised by a Tasso as a good narrator, if Racine loved this novel to the extent that he learned it by heart when his teachers had confiscated and burned two copies in a row, then this praise can only mean that the author has connected the adventures of his heroes in a planned way to the overall plot and that he has described them attractively. | B Author |
Nicholson Bakers Buch ist die Wiedergabe eines langen nächtlichen Telephongesprächs, in das der Erzähler lediglich durch knapp gesetzte „sagte er“ oder „sagte sie“ hineinregiert. Ansonsten wörtliche Rede im Dienste wechselseitiger Erregung. Jim und Abby, er von der Westküste, sie aus dem Osten der USA, haben sich durch einen erotischen Kontaktservice kennengelernt und sich nach kurzem Abtasten in der Öffentlichkeit eines Telephonpools in ein „elektronisches Hinterzimmer“ zurückgezogen. (Zeit 1992) | Nicholson Baker’s book is a rendition of a long late-night telephone conversation, into which the narrator merely interferes with tersely placed “he said” or “she said”. Otherwise verbatim speech in the service of mutual excitement. Jim and Abby, he from the West Coast, she from the eastern U.S., have met through an erotic contact service and, after briefly feeling each other out in the public of a telephone pool, have retreated to an ”electronic back room.” | C Heterodiegetic Narrator (fictive) |
Es ist ja alles in einem: das Drama eines Wahnsinnigen, der Schiff und Mannschaft um einer fixen Idee willen in den Untergang treibt; der Bericht eines Matrosen (des Erzählers Ishmael) über seine Schicksalsgenossen, über den Schiffbruch und seine Errettung; schließlich die sachkundige Abhandlung über Wale und Walfang. (Zeit 2006) | It is all in one: the drama of a madman who drives his ship and crew to their doom for the sake of an obsession; the report of a sailor (the narrator Ishmael) about his companions in misfortune, about the shipwreck and his salvation; and finally the expert treatise on whales and whaling. | D Homodiegetic Narrator (fictive) |
Example . | Translation . | Category . |
---|---|---|
W. Scherer macht in seiner „Geschichte der deutschen Dichtung des 11. und 12. Jahrhunderts” (S. 113) auf die Schilderung aufmerksam, die Ranke (Zur Geschichte der italienischen Poesie S. 19 f.), von dem Raccontatore in den Straßen von Venedig giebt: Auf der Riva Schiavone zu Venedig sieht man diesen Erzähler; alle Tage, wenn Feierabend gemacht wird, seine Zuhörer um sich sammeln. (Grenzboten 1878) | W. Scherer, in his “History of German Poetry of the 11th and 12th Centuries” (p. 113), draws attention to the account that Ranke (Zur Geschichte der italienischen Poesie p. 19 f.) gives of the raccontatore in the streets of Venice: On the Riva Schiavone in Venice one sees this narrator; every day, when work is finished, gathering his listeners around him. | A Oral Narrator |
Wenn dennoch Heliodor selbst von einem Tasso als guter Erzähler gepriesen wird, wenn Racine diesen Roman in dem Grade liebgewann, daß er ihn auswendig lernte, als seine Lehrer ihm zwei Exemplare nacheinander confiscire und verbrannt hatten, so kann mit jenem Lobe nur gemeint sein, daß der Verfasser in planvoller Weise die Abenteuer seiner Helden zur Gesamthandlung verbunden, und daß er sie anziehend geschildert hat. (Grenzboten 1857) | If Heliodor himself is praised by a Tasso as a good narrator, if Racine loved this novel to the extent that he learned it by heart when his teachers had confiscated and burned two copies in a row, then this praise can only mean that the author has connected the adventures of his heroes in a planned way to the overall plot and that he has described them attractively. | B Author |
Nicholson Bakers Buch ist die Wiedergabe eines langen nächtlichen Telephongesprächs, in das der Erzähler lediglich durch knapp gesetzte „sagte er“ oder „sagte sie“ hineinregiert. Ansonsten wörtliche Rede im Dienste wechselseitiger Erregung. Jim und Abby, er von der Westküste, sie aus dem Osten der USA, haben sich durch einen erotischen Kontaktservice kennengelernt und sich nach kurzem Abtasten in der Öffentlichkeit eines Telephonpools in ein „elektronisches Hinterzimmer“ zurückgezogen. (Zeit 1992) | Nicholson Baker’s book is a rendition of a long late-night telephone conversation, into which the narrator merely interferes with tersely placed “he said” or “she said”. Otherwise verbatim speech in the service of mutual excitement. Jim and Abby, he from the West Coast, she from the eastern U.S., have met through an erotic contact service and, after briefly feeling each other out in the public of a telephone pool, have retreated to an ”electronic back room.” | C Heterodiegetic Narrator (fictive) |
Es ist ja alles in einem: das Drama eines Wahnsinnigen, der Schiff und Mannschaft um einer fixen Idee willen in den Untergang treibt; der Bericht eines Matrosen (des Erzählers Ishmael) über seine Schicksalsgenossen, über den Schiffbruch und seine Errettung; schließlich die sachkundige Abhandlung über Wale und Walfang. (Zeit 2006) | It is all in one: the drama of a madman who drives his ship and crew to their doom for the sake of an obsession; the report of a sailor (the narrator Ishmael) about his companions in misfortune, about the shipwreck and his salvation; and finally the expert treatise on whales and whaling. | D Homodiegetic Narrator (fictive) |
4.2 Workflow and evaluation
For manual annotation, we extracted all instances of the target concept ‘Erzähler’ (narrator), encompassing various morphological/surface forms, including genitive forms (‘Erzählers’), feminine variants (‘Erzählerin’), and plurals (‘Erzählern’, ‘Erzählerinnen’). We retrieved all sentences from both the Grenzboten and the Zeit corpus in which the target occurs (the ‘target sentence’) and two sentences before and two after the target sentence. In the end, we annotated all text snippets from the Grenzboten corpus and drew a random sample from the Zeit corpus. DVjs was not part of the manual annotation cycle.
We used a double blind annotation process to annotate the various word senses associated with the narrator concept, and additionally, annotators noted a confidence score per label (0–100) to indicate how certain they were for this annotation. Overall, there were three annotators with a background in literary studies (minimum bachelor's degree), where Annotator 1 (A1) annotated everything, and Annotators 2 and 3 (A2, A3) split the data among them. We annotated in batches (of 100 or 200 text snippets) and after each batch (cycle), we calculated agreement among the annotators and discussed problematic instances (where there was no agreement). These discussions resulted in a gold-standard which represents the consensus of the annotators after discussion (see the data repository: Haider and Gittel 2025). In ambiguous cases where the discussion between the annotators was not conclusive (ca. 10% of all data points) a multi-label annotation (e.g. an instance can be both A and B, the first label being slightly more probable) was accepted and (then labeled as ‘A/B’). In order to calculate inter-annotator agreement, and for further statistics and computational experiments, all secondary labels (from the multi-label annotation for the ambiguous cases) were removed. We also removed instances from a fictional context (e.g. from a fictional narrative as contained in the Grenzboten corpus or from a quote of a fictional text).
Overall, we ended up with 885 annotated instances in the gold-standard, in which the ‘author’ sense (B) is surprisingly frequent, while the ‘heterodiegetic narrator’ sense (C) is infrequent. We provide inter-annotator agreement in accuracy and Cohen’s Kappa (see Table 2).7
Word sense . | Category . | Instances . | Accuracy (A1, A2) . | Accuracy (A1, A3) . |
---|---|---|---|---|
Oral Narrator | A | 64 | 0.53 | 0.41 |
Author | B | 590 | 0.84 | 0.75 |
Heterodiegetic Narrator (fictive) | C | 77 | 0.42 | 0.33 |
Homodiegetic (character) Narrator (fictive) | D | 154 | 0.62 | 0.76 |
Total (micro Accuracy) | ALL | 885 | 0.85 | 0.80 |
Total (micro Kappa) | ALL | 0.80 | 0.74 |
Word sense . | Category . | Instances . | Accuracy (A1, A2) . | Accuracy (A1, A3) . |
---|---|---|---|---|
Oral Narrator | A | 64 | 0.53 | 0.41 |
Author | B | 590 | 0.84 | 0.75 |
Heterodiegetic Narrator (fictive) | C | 77 | 0.42 | 0.33 |
Homodiegetic (character) Narrator (fictive) | D | 154 | 0.62 | 0.76 |
Total (micro Accuracy) | ALL | 885 | 0.85 | 0.80 |
Total (micro Kappa) | ALL | 0.80 | 0.74 |
Word sense . | Category . | Instances . | Accuracy (A1, A2) . | Accuracy (A1, A3) . |
---|---|---|---|---|
Oral Narrator | A | 64 | 0.53 | 0.41 |
Author | B | 590 | 0.84 | 0.75 |
Heterodiegetic Narrator (fictive) | C | 77 | 0.42 | 0.33 |
Homodiegetic (character) Narrator (fictive) | D | 154 | 0.62 | 0.76 |
Total (micro Accuracy) | ALL | 885 | 0.85 | 0.80 |
Total (micro Kappa) | ALL | 0.80 | 0.74 |
Word sense . | Category . | Instances . | Accuracy (A1, A2) . | Accuracy (A1, A3) . |
---|---|---|---|---|
Oral Narrator | A | 64 | 0.53 | 0.41 |
Author | B | 590 | 0.84 | 0.75 |
Heterodiegetic Narrator (fictive) | C | 77 | 0.42 | 0.33 |
Homodiegetic (character) Narrator (fictive) | D | 154 | 0.62 | 0.76 |
Total (micro Accuracy) | ALL | 885 | 0.85 | 0.80 |
Total (micro Kappa) | ALL | 0.80 | 0.74 |
As can be seen from Table 2, some word senses are easier to annotate. One observes that annotators are relatively confident for labels B (author) and D (homodiegetic narrator), while annotating the word senses A (oral narrator) and C (heterodiegetic narrator) is considerably harder. This is reflected in the accuracy numbers for each individual label. Cohen’s Kappa (κ, Cohen 1960) over all labels was substantial (κ(A1, A2) = 0.80, κ(A1, A3) = 0.74).
During the categorical annotation and our error analysis we observed several things:
Particularly for label C, we noticed that annotators would regularly use an ‘exclusion principle’, such that C was typically annotated when it was clear that the instance would not belong to any other word senses.
Proper names, especially family names, that corefer with instances of ‘Erzähler’ are a good indicator for the category B (author of a narrative), e.g. ‘Barbara Honigmann ist eine große Erzählerin’ (‘Barbara Honigmann is a great storyteller’, Zeit 2000).
The distinction between B (author of a narrative) and D (homodiegetic narrator) is sometimes context-dependent in the sense that annotators have to understand whether the literary critic wrote about a work of fiction or non-fiction. The following example may help to illustrate the point:
For the most part, an eerie coziness dominates them. And an infinite abandonment. In the first story, ‘Arrival in Africa’, the narrator, then 12 years old, played furtive games with the boy next door, Franzi, at home in Linz, which were followed at some point by a sudden estrangement—another boy, whose name is probably not Franky […].8 (Zeit 1990, our transl.)
In this example one may think that the literary critic deals with an autobiographical text, in which case ‘Erzählerin’ (female form of ‘narrator’) would mean the author (category B). However, one may also think that the review is about a fictional text, in which case ‘Erzählerin’ would mean a fictive narrator who is part of the narrated world (category D). Please note that, although it is impossible to make a rational decision between B and D based on the text snippet alone in this example, which we selected for illustrative purposes, there are other instances of ‘Erzähler’ that clearly mean B or D, for example because certain genre terms like ‘novel’ clearly indicate the fiction status of the reviewed book or because there are obvious co-reference chains.
4.3 Diachronic analysis
Based on the gold-standard annotation, this section presents an overview of the data across time. On the one hand, this serves to understand the annotation data and, on the other hand, to gain initial insights regarding our hypotheses. Initially, we visualize the number of extracted and annotated ‘Erzähler’-instances over time (see Fig. 1).

Absolute distribution of instances and labels in gold-standard over time.
As evident from Fig. 1, the ‘Erzähler’ instances are unevenly distributed over time, and there is a notable increase of ‘Erzähler’ instances in the last third of the 19th century. Remarkably, this increase is not due to a varying time coverage of the Grenzboten corpus or an increasing number of reviews per decade. Since our study is semasiologically (not onomasiologically) oriented, we refrain from hypothesizing about causes for this interesting finding. The imbalance in the second half of the 20th century reflects the actual distribution of Zeit corpus data across decades due to random sampling. Please recall that our data is sparse in the 1920s and 1950s and that we do not have any data for the 1930s and 1940s.
To get a better understanding of the distribution of the annotation categories and a first idea concerning hypotheses H1 and H2, Fig. 2 shows the relative distribution of labels for each decade.

Relative distribution of primary labels per decade in gold-standard.
In Fig. 2, one may observe several things: (1) Most generally, the heterodiegetic narrator word sense (category C) and the homodiegetic narrator word sense (category D) become in relative numbers more frequent over time, especially in the 20th century, which supports hypothesis H1. At the same time, the ‘oral narrator’ word sense and the ‘author’ sense become less frequent, which speaks in favor of hypothesis H2. (2) The word sense heterodiegetic narrator (C) is already present, but very rare at the end of the 19th century and the beginning of the 20th century. However, it only became more frequent after 1950, reaching a proportion of more than 25 per cent after 2000. This data suggests that the ‘heterodiegetic narrator’ sense (C) is established only in the course of the 20th century. (3) The homodiegetic narrator word sense (D) has a significant share already in the 19th century, but becomes more common only in the same period as the heterodiegetic narrator word sense is established. (4) The strong growth of ‘heterodiegetic narrator’ sense (C) and ‘homodiegetic narrator’ sense (D) after 1950 occurs at the expense of the ‘author’ sense. In general, these observations clearly speak in favor of the main hypothesis H that encompasses H1 and H2.9
5. Computational modeling
In this section, we document the tuning of different BERT models to recognize the different word senses of ‘Erzähler’ (narrator) for a subsequent diachronic large-scale analysis. We are particularly interested in the extent to which contextualized word embedding models learn to disambiguate the word senses and encode the dimension of fictivity that separates the categories A and B on the one hand from C and D on the other hand. Thus, we formulated two sub-tasks, (1) supervised text classification of text snippets, and (2) unsupervised visualization of embedding variation of ‘Erzähler’ tokens, utilizing a transfer learning setup.
The next subsection presents an evaluation of text classification of text snippets containing ‘Erzähler’-tokens with 10-fold cross-validation (Section 5.1). The following section will present the visualization of embedding variation of ‘Erzähler’ tokens in selected models (Section 5.2). The last section employs the ensemble of the models from the cross-validation for a large-scale analysis to shed light on the diachronic sense change of the ‘Erzähler’ concept in literary criticism (Section 5.3).
5.1 Classification
To assess the ability of BERT language models (Devlin et al. 2019) to learn the categories via supervised text classification, we utilized the gold-standard annotation of text snippets (target sentence with ‘Erzähler’ token plus two sentences before and after) and the assigned primary label of narrator meaning. In particular, two different setups were explored, a Fine-grained Setup which distinguishes between the categories A, B, C, D, and a Coarse-grained Setup, which distinguishes Non-fictive (A, B) vs. Fictive (C, D).
5.1.1 Fine-grained models (A, B, C, D)
Using a multi-class setup (where one text is assigned one label), the models were tuned with a text classification head for BERT (via CLS token). We performed a ten-fold cross-validation where the dataset is randomly split into 80/10/10 training/dev/test sets. We used the same fold datasets for all settings to ensure comparability. Each model was tuned for 10 epochs.
Since we are dealing with a low-resource setting, we tested different setups to improve classification accuracy:
Vanilla (Naive)
This is the most basic setup, where we used a vanilla BERT model and the original label distribution during training and testing.
Downsampling of B in Training Set
In this case, we downsampled the B instances in the training set to maximum 300 instances, to mitigate the overrepresentation of B in the training data. The test set stayed intact.
Transfer Learning
In this setup, we tuned the BERT model on a different dataset than our gold-standard and then employed the resulting language model for our task:
Transfer Redew
This model originates from the Redewiedergabe-project (Brunner et al. 2020), where a German BERT model was tuned with next word prediction on historical German, including the Grenzboten dataset.10
Transfer Zeit Dec
We used a supervised tuning where the task was to classify the 8000 ‘Erzähler’ text snippets from the Zeit dataset regarding the decade to which they belong.
Transfer Zeit Next Word
We used an unsupervised tuning where the 8000 Erzähler text snippets from the Zeit dataset are used for next word prediction.
As can be seen in Table 3, in all setups, the overrepresented class B is learned better than the other classes. Especially for the underrepresented classes A and C, one observes suboptimal performance. In the downsampling and transfer learning setup, there are no significant performance gains. The Transfer models ‘Zeit Next Word’ marginally improve the averaged F1-scores for B and D over the vanilla models, but they also show higher variability as evident from the standard deviation. Therefore, we focus on the vanilla models in the remainder of the paper.
Naive . | Downsampling . | Transfer Redew . | Transfer Zeit Dec . | Transfer Zeit Next Word . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Class . | F1 . | SD . | F1 . | SD . | F1 . | SD . | F1 . | SD . | F1 . | SD . |
A | 0.318 | 0.267 | 0.413 | 0.233 | 0.311 | 0.251 | 0.302 | 0.225 | 0.332 | 0.243 |
B | 0.842 | 0.052 | 0.828 | 0.054 | 0.825 | 0.046 | 0.839 | 0.045 | 0.853 | 0.046 |
C | 0.396 | 0.142 | 0.348 | 0.118 | 0.352 | 0.168 | 0.380 | 0.179 | 0.380 | 0.170 |
D | 0.557 | 0.063 | 0.543 | 0.118 | 0.515 | 0.075 | 0.611 | 0.081 | 0.580 | 0.117 |
Total (macro) | 0.528 | 0.085 | 0.533 | 0.062 | 0.501 | 0.076 | 0.533 | 0.065 | 0.537 | 0.057 |
Total (weighted) | 0.716 | 0.058 | 0.705 | 0.055 | 0.688 | 0.059 | 0.721 | 0.054 | 0.724 | 0.057 |
Naive . | Downsampling . | Transfer Redew . | Transfer Zeit Dec . | Transfer Zeit Next Word . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Class . | F1 . | SD . | F1 . | SD . | F1 . | SD . | F1 . | SD . | F1 . | SD . |
A | 0.318 | 0.267 | 0.413 | 0.233 | 0.311 | 0.251 | 0.302 | 0.225 | 0.332 | 0.243 |
B | 0.842 | 0.052 | 0.828 | 0.054 | 0.825 | 0.046 | 0.839 | 0.045 | 0.853 | 0.046 |
C | 0.396 | 0.142 | 0.348 | 0.118 | 0.352 | 0.168 | 0.380 | 0.179 | 0.380 | 0.170 |
D | 0.557 | 0.063 | 0.543 | 0.118 | 0.515 | 0.075 | 0.611 | 0.081 | 0.580 | 0.117 |
Total (macro) | 0.528 | 0.085 | 0.533 | 0.062 | 0.501 | 0.076 | 0.533 | 0.065 | 0.537 | 0.057 |
Total (weighted) | 0.716 | 0.058 | 0.705 | 0.055 | 0.688 | 0.059 | 0.721 | 0.054 | 0.724 | 0.057 |
Naive . | Downsampling . | Transfer Redew . | Transfer Zeit Dec . | Transfer Zeit Next Word . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Class . | F1 . | SD . | F1 . | SD . | F1 . | SD . | F1 . | SD . | F1 . | SD . |
A | 0.318 | 0.267 | 0.413 | 0.233 | 0.311 | 0.251 | 0.302 | 0.225 | 0.332 | 0.243 |
B | 0.842 | 0.052 | 0.828 | 0.054 | 0.825 | 0.046 | 0.839 | 0.045 | 0.853 | 0.046 |
C | 0.396 | 0.142 | 0.348 | 0.118 | 0.352 | 0.168 | 0.380 | 0.179 | 0.380 | 0.170 |
D | 0.557 | 0.063 | 0.543 | 0.118 | 0.515 | 0.075 | 0.611 | 0.081 | 0.580 | 0.117 |
Total (macro) | 0.528 | 0.085 | 0.533 | 0.062 | 0.501 | 0.076 | 0.533 | 0.065 | 0.537 | 0.057 |
Total (weighted) | 0.716 | 0.058 | 0.705 | 0.055 | 0.688 | 0.059 | 0.721 | 0.054 | 0.724 | 0.057 |
Naive . | Downsampling . | Transfer Redew . | Transfer Zeit Dec . | Transfer Zeit Next Word . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Class . | F1 . | SD . | F1 . | SD . | F1 . | SD . | F1 . | SD . | F1 . | SD . |
A | 0.318 | 0.267 | 0.413 | 0.233 | 0.311 | 0.251 | 0.302 | 0.225 | 0.332 | 0.243 |
B | 0.842 | 0.052 | 0.828 | 0.054 | 0.825 | 0.046 | 0.839 | 0.045 | 0.853 | 0.046 |
C | 0.396 | 0.142 | 0.348 | 0.118 | 0.352 | 0.168 | 0.380 | 0.179 | 0.380 | 0.170 |
D | 0.557 | 0.063 | 0.543 | 0.118 | 0.515 | 0.075 | 0.611 | 0.081 | 0.580 | 0.117 |
Total (macro) | 0.528 | 0.085 | 0.533 | 0.062 | 0.501 | 0.076 | 0.533 | 0.065 | 0.537 | 0.057 |
Total (weighted) | 0.716 | 0.058 | 0.705 | 0.055 | 0.688 | 0.059 | 0.721 | 0.054 | 0.724 | 0.057 |
Figure 3 shows the cumulative confusion matrix where all vanilla models from the cross-validation are used to predict the respective test set, aggregating all gold and predicted labels. One observes that despite class A showing a low recall, it is basically only confused for class B, thus not posing an issue for the detection of fictivity.

Cumulative confusion matrix of fine-grained vanilla models aggregated over all fold test sets.
Classes C and D (the fictivity classes) are not learned as well as class B, but still offer fair performance to utilize the model for large-scale prediction. Overall, the main problem is that C and D are frequently misclassified as B (and vice versa). The confusion between B and D can be explained by drawing on observations during the annotation: The same sentence can often have two interpretations depending on the guessed classification of the book under review. For example, in the sentence ‘the narrator describes his holidays in Russia’, ‘narrator’ is usually disambiguated as ‘writer of the narrative’ if one knows or suspects that the book under review is non-fiction, e.g. an autobiography. However, if one suspects that the book under review is fiction, one disambiguates ‘narrator’ as a ‘homodiegetic narrator’. This kind of inferential dependency seems hard to grasp for the model, which is hardly surprising given the fact that it is non-trivial even for humans to infer the fiction status of the book under review, if the text snippet does not contain genre terms like ‘novel’ or ‘report’ which indicate a certain fiction status.
Finally, to investigate not only the proposed senses, but also whether models can learn a dimension of fictivity, we run experiments where the classes are conflated in their super-classes fictive vs. non-fictive.
5.1.2 Coarse-grained model: fictive vs. non-fictive
Table 4 shows the results for the classification into fictive vs. non-fictive ‘Erzähler’ instances. In this setup, two classes each are aggregated, such that A and B form the non-fictive class, and C and D form the fictive class. To make the results comparable to the previous experiments (Table 3), we perform ten-fold cross-validation on the data of the previous folded data splits, only with aggregated classes.
Text classification performance of coarse-grained models: fictive vs. non-fictive.
Class . | Precision . | Recall . | F1-Score . |
---|---|---|---|
Fictive | 0.747 (±0.105) | 0.652 (±0.153) | 0.683 (±0.096) |
Non-Fictive | 0.857 (±0.069) | 0.905 (±0.048) | 0.878 (±0.037) |
Macro Avg | 0.802 | 0.779 | 0.781 |
Class . | Precision . | Recall . | F1-Score . |
---|---|---|---|
Fictive | 0.747 (±0.105) | 0.652 (±0.153) | 0.683 (±0.096) |
Non-Fictive | 0.857 (±0.069) | 0.905 (±0.048) | 0.878 (±0.037) |
Macro Avg | 0.802 | 0.779 | 0.781 |
Text classification performance of coarse-grained models: fictive vs. non-fictive.
Class . | Precision . | Recall . | F1-Score . |
---|---|---|---|
Fictive | 0.747 (±0.105) | 0.652 (±0.153) | 0.683 (±0.096) |
Non-Fictive | 0.857 (±0.069) | 0.905 (±0.048) | 0.878 (±0.037) |
Macro Avg | 0.802 | 0.779 | 0.781 |
Class . | Precision . | Recall . | F1-Score . |
---|---|---|---|
Fictive | 0.747 (±0.105) | 0.652 (±0.153) | 0.683 (±0.096) |
Non-Fictive | 0.857 (±0.069) | 0.905 (±0.048) | 0.878 (±0.037) |
Macro Avg | 0.802 | 0.779 | 0.781 |
Similar to the manual annotation, it is more challenging to identify text snippets with fictive ‘Erzähler’-word senses than text snippets with non-fictive ‘Erzähler’ senses. This is partly attributable to the high number of non-fictive examples in the training set. Another part of the explanation may be that it is not trivial to distinguish B (non-fictive) from D (fictive) (see Fig. 3), as also observed during the manual annotation and in the experiment above. Overall, we consider this model to be adequate for large-scale annotation.
5.2 Word sense disambiguation and domain dependence
This section sheds light on what specific models, including those from the Coarse-Grained and the Fine-Grained Setup, have learned about the word senses of the target concept and fictivity. So far we have formulated the task to distinguish ‘Erzähler’ word senses as text classification, using the entire text snippets to make a prediction. Additionally, we are interested in the word senses of the target concept ‘Erzähler’, as they are understood within their context, and to what extent the ‘Erzähler’ token encodes a dimension of fictivity. This is operationalized by visualizing the variation of token embeddings of the narrator concept.
In order to evaluate how well different models can represent the narrator concept, we compare four different models and visualize their representations of the target concept. Ideally, pre-trained transformer language models should be able to distinguish word senses out of the box. However, this has proven to be problematic concerning domain-specific language as in our use case.
We utilized a transfer learning task that combines the benefits of fine-tuning through text classification and subsequent token embedding extraction from a given model. This method offers several compelling reasons for its adoption: first, extracting token embeddings after fine-tuning allows one to gauge the model’s ability to encode specific words, such as ‘Erzähler’, into vector representations. These embeddings provide a condensed yet rich representation of the word’s senses within the trained context. Second, the method also serves to uncover the ability of pre-trained and tuned models to encode specific word senses in literary criticism, and the dimension of fictivity specifically.
We compared the following models:
Untuned BERT-base
BERT-base-historical-redew: Redewiedergabe (BERT-based unsupervised tuned on i.a., Grenzboten)
BERT-base tuned on Fine-Grained (ABCD) text classification (Fold 1 Model)
BERT-base tuned on Coarse-Grained (Fictive vs. Non-Fictive) text classification (Fold 1 Model)
Model 1 is a generic German BERT base that has never ‘seen’ literary language or literary criticism, but it is the basis for all following models. Model 2 is based on Model 1, but was tuned with an (unsupervised) next token prediction on literary language, mainly narrative text, but also including the target corpus Die Grenzboten. Model 3 is the Fine-grained Model from Fold 1 (see Table 3) that was tuned (supervised) on the gold-standard with an ABCD multi-class objective. Model 4 is the Coarse-grained Model from Fold 1 that was tuned (supervised) with the objective to distinguish the fictive from non-fictive ‘Erzähler’ senses (see Table 4). Compared to Model 1, we expect Models 2, 3, and 4 to be more capable to encode a dimension of fictivity, as they have seen several instances of ‘Erzähler’ during tuning, while Models 3 and 4 should have learned a proper representation of fictivity.
Concretely, we proceeded as follows: we extracted the last layer from the token embedding of ‘Erzähler’ from each of four (frozen) models under scrutiny, where the entire context is at the input and visualized the resulting vectors with TSN-E (t-distributed stochastic neighbor embedding). Since each vector from BERT comes with 768 dimensions, we reduced the vector space to two dimensions, where TSN-E can be expected to preserve the distances between individual instances (similar points in the original space remain close together in the reduced space).
Figure 4 illustrates that, unsurprisingly, the BERT-base model does not properly encode ‘Erzähler’ word senses: the sub-clusters identified do not line up with the word sense variants we identified. Second, the Redewiedergabe-model (which was tuned inter alia on the Grenzboten corpus with a next-token prediction) tends towards pushing the fictive instances (C, D) to the left, besides an idiosyncratic blob on the opposite side. This might be an indicator that this model learned some sort of a fictivity dimension. Third, the models that were tuned using our annotation data (bottom) show a clearer picture to distinguish the annotation categories. The Fine-grained ABCD-Model (bottom left) learned to distinguish C and D from B. The Coarse-grained Model that was only trained on Fictive vs. Non-Fictive properly distinguishes these dimensions, and also shows clear cluster hubs to separate the two classes.

t-SNE Visualization of ‘Erzähler’-Embeddings. The model names refer to the type of tuning: BERT-base (top left); Redewiedergabe (top right); fine-grained ABCD-model (bottom left); coarse-grained fictive vs. non-fictive model (bottom right).
To sum up, we may conclude that both the performance of our models (see Section 5.1) and the insights we gained from the visualization of the token embeddings suggest that the Fine-Grained and the Coarse-Grained Models can be reasonably used for large-scale prediction.
5.3 Automated large-scale annotation
In this section, we employ the ensemble of our tuned text classification models (see vanilla models in Table 3) to predict the meaning variation over time to see whether the meaning change we identified in the manual annotation can be generalized to larger datasets. The large-scale analysis is carried out with the main objective to track semantic change over time and the secondary objective to compare different reader groups (scholars vs. non-scholars). We first conducted an analysis of text snippets containing the narrator token in the Zeit Corpus, second a sanity check of this analysis where we filter out reviews of non-fiction, and third a comparison between non-scholarly and scholarly use of the narrator term using the DVjs corpus.
5.3.1 Semantic change over time (Zeit corpus)
In this subsection, we employ the models to annotate the rest of the Zeit corpus to analyze the semantic change of ‘Erzähler’ (narrator) over time. The unannotated Zeit dataset contains around 8,000 text samples (containing ‘Erzähler’), where data from the 1940s is underrepresented. All ten cross-validation vanilla models were used for large-scale prediction in the following way: in a first step, for every decade and model, the relative distribution of labels was computed. In a second step, we calculated mean and standard deviation for every decade and label given all ten models. We thus estimate the variation of the predictions of the models. See Fig. 5 (Fine-Grained Setup) and Fig. 6 (Coarse-Grained Setup).

Relative distribution of word senses for target concept ‘Erzähler’ (narrator) over time via large-scale automatic annotation (Fine-grained Setup), Zeit corpus.

Relative distribution of word senses for target concept ‘Erzähler’ (narrator) over time via large-scale automatic annotation (Coarse-grained Setup), Zeit corpus.
Akin to the visualization of the manually annotated gold-standard (see Fig. 2), the ‘author’ sense of ‘Erzähler’ (B) loses significance and the ‘homodiegetic narrator’ sense (D) is increasingly used (see Fig. 5). The ‘heterodiegetic narrator’ sense (C) also undergoes an increase albeit a bit more subtle. Figure 6 solidifies this observation, where the robust Coarse-Grained model shows a clear shift from the (non-fictive) ‘author’ sense towards a preferred use of the fictive narrator. Interestingly, the semantic developmentseems to be still ongoing, the fictive and the non-fictive word sense being on a par in the 2010s.
To examine the significance of the trends as evident from Fig. 5, we performed a trend analysis with Spearman’s r by correlating the mean relative frequency of a class with the decade and calculate significance with a two-tailed t-test (whether the trend is significantly increasing/decreasing). We found an upward trend for categories C (r = +0.93, P < 0.001, and D (r = +1, P < 0.001), and a downward trend for category B (r = −1, P < 0.001). Category A shows no significant change.
5.3.2 Sanity check: what about reviews of non-fiction? (filtered Zeit corpus)
There is one possible objection against our findings:11 The author-narrator distinction is only sensible in regard to fictional texts. It is superfluous in the case of factual (or non-fictive) texts. So if our corpus contained a high number of reviews of non-fiction (e.g. autobiographical or other factual narratives), this would explain the high frequency of category B (‘author’) without saying much about the emergence of the author–narrator distinction in discussions of fiction.
In order to address this problem, we conducted a manual annotation of 100 sampled reviews from the Zeit corpus using the categories ‘review of fiction’ vs. ‘review of non-fiction’. The ratio was 75:25, such that labeling every text as fictional already gives us an accuracy of 75 percent, a hard baseline to beat with a classifier. We trained several classifiers (Logistic Regression, SVM, Random Forest), which were on par with the baseline and which were mostly ignoring the non-fictional class (F1 < 0.2). A BERT model tuned on 80 per cent of the annotated texts and optimized for precision of the fiction reviews (Softmax decision threshold for fictional is at 70 per cent instead of 50 per cent), gave us a macro F1 of 84 per cent (and accuracy 90 per cent). Optimizing for precision biases the classifier in a way that it is certain about its decision, but it might not find all relevant instances.
Using this classifier, we filtered the automatically annotated Zeit corpus to keep reviews of fiction only (excluding ca. 800 texts, amounting to ca. 10 per cent of reviews of non-fiction) and obtained the results in Fig. 7. Comparing Fig. 7 to Fig. 5, it is obvious that the trends which we observed before are more pronounced, such that class B becomes less prominent and classes D and C become more prominent (cf B and D being on a par in the 2010s).

Relative distribution of word senses for target concept ‘Erzähler’ (narrator) over time via large-scale automatic annotation (Fine-grained Setup), after filtering out reviews of non-fiction with BERT classifier, Zeit corpus.
Certainly, such the classifier used is not perfect and can only show us a tendency of the effect the reviews of non-fiction have on the frequency analysis. Given this clear tendency, one can assess a hypothetical scenario with a perfect classifier. Let us make two hypothetical assumptions: (1) A perfect correlation between non-fiction and class B (all narrator instances mentioned in reviews of non-fiction are classified as class B), and (2) a perfect classifier that detects all these instances. Then, if we filtered all instances of non-fiction, the semantic change would be even more pronounced, and B even less prominent (ca. 30 per cent in the 2010s).
5.3.3 Comparing scholarly to non-scholarly readers (DVjs vs. Zeit corpus)
In a last step, we employed the ensemble of our tuned text classification models (see Vanilla models in Table 3) to the DVjs corpus (cf Fig. 8) in exactly the same way as described above for the Zeit corpus (cf Figs 5 and 7).

Relative distribution of word senses for target concept ‘Erzähler’ (narrator) over time via large-scale automatic annotation (Fine-grained Setup), DVjs corpus.
Since the DVjs is a literary studies journal, including theoretically orientated work, Fig. 8 provides complementary data to the Zeit corpus (unfiltered for reviews of non-fiction, see Fig. 5). In a nutshell, one might say that each of these corpora represents a different reader group, literary scholars (DVjs) vs. literary critics in the narrower sense of ‘Literaturkritiker’ (Zeit). However, it is important to note that both groups can be regarded as professional readers, in contrast to lay readers. Four things can be observed:
In the DVjs corpus, too, there is a (mild) semantic shift from the actual author of a narrative to a fictive instance that the reader of fiction has to imagine according to literary conventions, supporting hypothesis H. However, this change is in general less pronounced than in the Zeit corpus given the reference to the author is still by far the dominant use of ‘narrator’. This may be due to the fact that only a relatively small portion of articles in DVjs deals with fictional narratives or is concerned with narratological questions where one would expect the appearance of the non-fictive narrator word senses.
Furthermore, this mild semantic shift occurs suddenly during the 1950s. The proportion of the ‘heterodiegetic narrator’ sense (C) rises from basically 0 to over 10 per cent (11.4 per cent) and the proportion of the ‘homodiegetic narrator’ sense (D) also increases sharply (to 8.3 per cent). One possible explanation for this is that this change is theory-driven, since the theoretical discussion about the author-narrator distinction in the German-speaking world (featuring among others Käte Hamburger, Wolfgang Kayser Eberhard Lämmert, and Franz Stanzel) intensified during this period and even took place in part in the DVjs (Cornils and Schernus 2003: 151–4). In contrast to literary critics in the narrower sense (Zeit corpus), literary scholars presumably follow these debates actively and are more directly influenced by them than they are.
The maximum proportion of ‘heterodiegetic narrator’ sense (C) is higher than in the Zeit corpus and reaches its peak as early as the 1970s (15.3 per cent), in contrast to the Zeit corpus where the peak is only reached in the 2000s (8.7 per cent). This could indicate that the semantic change is driven by the discourses in literary studies, which—with a time lag—are reflected in literary critics’ language use via institutional educational processes (textbook production, studies, school education).
The proportion of ‘homodiegetic narrator’ senses (D) is much lower in the DVJs corpus than in the Zeit corpus and stays more or less stable at around 10 per cent after the 1960s, while it increases gradually in the Zeit corpus. A possible explanation may be that literary critics in the Zeit deal more frequently with contemporary literature featuring a first-person narrator than literary scholars in the DVjs who usually deal with canonized works of literature.
6. Summary and future work
The present article introduced a quantitative semantic approach to study the history of the author-narrator distinction in literary criticism. Following a semasiological approach, we studied the semantic change of the term ‘narrator’ (Erzähler), more specifically the gradual endorsement of a fictive word sense of the term, which we regard as a semantic prerequisite for an essential part of the modern practice of fiction: reading texts as if they were told by a fictive narrator. We discerned four basic meanings of the narrator concept—oral narrator, the author of a narrative, a fictive narrator who is not part of the narrated world (heterodiegetic narrator), fictive homodiegetic narrator. Using data from historical periodicals (1841–2018), we manually annotated these word senses of ‘Erzähler’ in literary criticism and generated a gold-standard annotation. In a second step, we used the gold-standard data for supervised text classification of text snippets containing ‘Erzähler’ instances. In a third step, we used the cross-validated models for a large-scale analysis.
Our findings consistently support the main hypothesis H that the term gradually shifted its meaning from the actual author of a narrative to a fictive instance that the reader of fiction has to imagine according to literary conventions. While we found ‘Erzähler’ instances with the word sense ‘homodiegetic narrator’ already in the middle of the 19th century and ‘Erzähler’ instances with the word sense ‘heterodiegetic narrator’ already in the last third of the 19th century, they only get much more frequent from the 1970s onwards, the ‘author’ sense still being a common meaning. Pointedly, our data suggests that we are in the middle of an ongoing meaning shift in contrast to a word like ‘gay’ that almost completely shifted its meaning from ‘happy’ to ‘homosexual’ (see Hamilton, Leskovec, and Jurafsky 2016a). Since the term narrator is inherently bound to the verb ‘to narrate’ (erzählen), it remains to be seen whether a total displacement of the original word sense will take place or whether several word senses will continue to coexist in the future.
Future work could shed light on the causal relationship between theoretical debates in literary studies on the one hand and literary critics’ use of the narrator concept on the other. Our analysis using the DVjs corpus seems to suggest that the semantic change of the narrator concept is (partly) theory-driven. However, there are confounding factors (which works with a particular narrative perspective are published, which are selected and discussed by literary critics, intensity of narratological debate, generational effects due to differing literary education) that are not easy to disentangle. Other routes worth pursuing include studying the use of relevant author/narrator concepts in languages other than German. Although our limited zero-shot learning experiments did not yield competitive results, generative LLMs with In-Context learning show great promise, especially across languages.
Acknowledgements
In addition to the funding agency, we cordially thank our research assistants Jan Philipp Lau, Friederike Altmann and Rebecca Rist. We also thank the Max Planck Institute for Empirical Aesthetics in Frankfurt am Main for providing computing resources on their high performance cluster.
Author contributions
Benjamin Gittel (Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Resources) and Thomas Haider (Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization).
Funding
This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) [grant number 497113588, to B.G.].
Data availability
Data and code supporting this article are available in a dedicated publicly accessible GitHub repository, published as zenodo repository: Haider and Gittel (2025): https://doi.org/10.5281/zenodo.14917297, https://github.com/tnhaider/narrator_semantic_change. We release it under a CC BY 4.0 license. The corpus data were derived from three resources: (1) The Grenzboten Corpus (Nölte et al., 2016) is freely available, e.g., through the German Text Archive (https://www.deutschestextarchiv.de/grenzboten/), (2) The Zeit Corpus (Barbaresi 2021; Geyken et al. 2017) is not freely available in full text, but it can be searched through the DWDS platform (https://www.dwds.de/d/korpora/zeit). (3) The DVjs Corpus (Deutsche Vierteljahrsschrift für Literaturwissenschaft und Geistesgeschichte) is available online (https://www.digizeitschriften.de/), but only through a subscription plan.
Notes
Here and in the following, we use the term ‘fictive’ when we refer to the ontological status (non-actual) and the term ‘fictional’ when we refer to the communication or representational level.
We use the terms ‘meaning’ and ‘word sense’ in line with the linguistic convention according to which one word can have different word senses based on the context of the word’s usage in a specific context. Only by figuring out which word sense is appropriate in a given context (e.g. ‘bank’ as financial institute or seating furniture), one can assign a meaning to an instance of the word. For example, only by figuring out that ‘narrator’ has the word sense ‘homodiegetic narrator’ in a critic’s utterance ‘The narrator of Moby Dick is fascinated by the whale’, one can assign the meaning ‘Ishmael’ to the term ‘narrator’.
One might object that one can follow conventions without having the concepts that play a role in the linguistic formulation of the conventions. This is certainly true; one can, for example, follow the convention of shaking hands as a greeting without having the concept of ‘shaking hands’. However, the reading convention outlined above is not analogous, since it involves propositional attitudes about certain entities. A better analogy would then be the convention of praying to God every evening—a convention that one can hardly follow (in contrast to pretending to follow) without having a concept of God.
The annotation guidelines are available in the repository supporting the present article, see Data availability section.
For the annotation of A and B, the source of the story is irrelevant, as is the status of the story told, as fictional or non-fictional.
For example, ‘In the Höhgauer Erzähler, the district court judge Veck in Offenburg […] published a review of the Old Catholic movement in Baden and proved that […]’ (Grenzboten 1897, our translation).
The calculation of Cohen’s Kappa essentially corrects the measured ‘accuracy’ of two annotators by a hypothetical random chance agreement, which, overall, for four categories can be estimated to be 25 per cent. However, our classes are not evenly distributed, and thus chance agreement for a given class is non-trivial to estimate. Thus, to calculate agreement for particular classes (one or two annotators used a certain label for a text snippet), we provide the uncorrected accuracy agreement (without correction for random chance).
The German original reads: ‘Zumeist beherrscht sie eine unheimliche Gemütlichkeit. Und eine unendliche Verlassenheit. Mit dem Nachbarjungen Franzi hat—in der ersten Geschichte „Ankunft in Afrika“ - die Erzählerin, damals zwölf Jahre alt, daheim in Linz verstohlene Spiele getrieben, denen irgendwann eine jähe Entfremdung folgt—ein anderer, Franky heißt er wohl nicht zufällig […]’ (Zeit 1990).
A remark concerning the relative overrepresentation of A in the 1870s: As evident from Fig. 1, the gold-standard annotation for this decade comprises only approximately twenty instances. Therefore, this peculiarity of A can be attributed to the small sample size.
We would like to thank our anonymous reviewers who pointed this out to us.