Abstract

This article examines the epistemic consequences of unfair technologies used in digital humanities (DH). We connect bias analysis informed by the field of algorithmic fairness with perspectives on knowledge production in DH. We examine the fairness of Danish Named Entity Recognition tools through an innovative experimental method involving data augmentation and evaluate the performance disparities based on two metrics of algorithmic fairness: calibration within groups; and balance for the positive class. Our results show that only two of the ten tested models comply with the fairness criteria. From an intersectional perspective, we shed light on how unequal performance across groups can lead to the exclusion and marginalization of certain social groups, leading to voices and experiences being disregarded and silenced. We propose incorporating algorithmic fairness in the selection of tools in DH to help alleviate the risk of perpetuating silence and move towards fairer and more inclusive research.

1. Introduction

The integration of machine learning (ML) and advanced language technologies into the field of digital humanities (DH) calls for reflections on the potential machine-inflicted biases and the corresponding epistemic and social consequences. Despite ongoing research and discussions on the issue of fairness, such considerations are not yet an integral part of DH. In this article, we aim to connect bias analysis informed by the field of algorithmic fairness with perspectives from DH.

Discussions around diversity, bias, and representations are important for all fields that develop or depend on contemporary natural language processing (NLP). Empirical studies have repeatedly identified systematic bias in various aspects of NLP, from word embeddings (Kurita et al. 2019; Manzini et al. 2019) to coreference resolution (Zhao et al. 2018) and language generation (Sheng et al. 2021). Performance disparities and lack of representation in a base or foundation model can affect various downstream tasks, and when put to use in DH scholarly pipelines, such disparities can ultimately impact cultural knowledge production. In what follows, we use the construct silencing to explore these issues. Silencing is a way of conceptualizing computational tools as having the power to decide who and what is included and excluded (Carter 2006).

We specifically examine named entity recognition (NER) models’ ability to assign representation by correctly recognizing named entities. This is done by examining representational biases measured as systematic error disparities in NER frameworks. The experiment focuses on the NER task in Danish and examines error disparities in relation to sensitive features analysed through the prism of intersectionality. To analyse these disparities, we use fairness measures from the field of algorithmic fairness and consider three case studies where these findings are directly relevant to knowledge production in DH. By contributing to ongoing work of measuring and quantifying the impacts of biases in NLP and grounding these findings in applied contexts, we aim to provoke reflections on the epistemic consequences of unfair models and their use in DH research. While our results are immediately relevant to DH scholars as they consider integrating tools into their research pipelines, the implications of this study go beyond the realm of Danish NER tools. Applying the suggested experimental approach to different societal contexts can provide a better understanding of bias in NER tools and the consequent epistemic implications.

The article is structured as follows. In Section 1.1, we provide a theoretical framework for bias in computer systems and explore its connection to the concepts of silencing and intersectional feminism. Section 1.2 reviews existing literature on bias in NLP tools, focusing on earlier studies related to our work. We then situate our work within the field of algorithmic fairness in Section 1.3 and introduce the fairness measures used. In Section 2, we present our experimental pipeline, and Section 3 presents the findings of our experiment and relates these to the field of DH through three cases in Section 4. We explore the implications of these findings in Section 5, and finally, in Section 6, we conclude the article by summarizing our main contributions and outlining future considerations for fair and inclusive DH research. Overall, our goal is to shed light on the epistemic consequences of algorithmic unfairness in technologies used in DH research and to explore ways in which the field can work towards more fair and inclusive practices.

1.1 Bias and silence

With their influential typology of biases, Friedman and Nissenbaum (1996) argued that ‘freedom of biases’ should qualify as a metric for system performance similar to reliability, accuracy, and efficiency. In their work, bias in computer systems is defined as systems which ‘systematically and unfairly discriminate against certain individuals or groups of individuals in favour of others.’ Bias in computer systems can be divided into three types of biases: Preexisting biases with roots in institutions, practices, and attitudes; technical biases arising from the resolution of issues in the technical specifications; and emergent bias which occurs in a use-context after the implementation of a given system.

In their survey of bias research in NLP specifically, Blodgett et al. (2020) present a conceptual framework to characterize and compare biases based on a recognition of the relationships between language and social hierarchies. In their framework, they distinguish between two types of bias: allocation bias and representational bias, building on earlier work by Crawford (2017). The former refers to disparities in the distribution of resources and opportunities among different groups, which can result in unequal access or treatment. The latter, representational bias, pertains to differences in how social groups are represented in the NLP system’s outputs. This can include biases in the form of stereotyping, generalization, or misrepresentation of certain groups. Whereas allocation bias might lead to more obvious harmful outcomes, representational biases are argued to be inherently harmful, as they can sustain and reinforce existing discrimination, echoing the concept of both preexisting as well as emergent bias in Friedman and Nissenbaum’s definition. Moreover, the concept of representational bias includes differences in system performance, which can be measured by error rates on specific tasks. It is based on this understanding that we aim to investigate bias in Danish NER models, specifically NER models’ ability to correctly recognize named entities and assign them the appropriate representation.

1.1.1 Silence and power of computational tools

Bias in NLP systems is of increasing concern, especially in scenarios where these systems are applied outside of research settings in a range of social settings, affecting society at large. There are concomitant epistemological and scientific issues when NLP systems disregard certain social groups. In our work, this disregard of unequal performance across social groups is conceptualized through the lens of silencing. This meaning of silencing as an epistemic injustice is developed extensively in Fricker (2007). For our purposes, silencing can be understood as having one’s voice, opinions, or perspectives suppressed or muted, either through direct external forces or indirect social dynamics. In the context of NLP, exclusion from datasets and poor representations can result in the silencing of marginalized people.

Carter (2006) explores how archival silence arises from unequal access, use, and display of records in shaping cultural tradition and social memory. Silence in archives denies marginalized groups their voice and opportunity to participate in archiving, resulting in their eventual disappearance from history and incomplete or biased representations of the past. Hence, archives are spaces of power where decisions are made about which voices to include and which to exclude. Similarly, the silencing of certain voices can happen in a research context when archives used for analysis do not represent all groups equally.

Following this argument, datasets are loci of power in a similar manner insofar as deciding what is included and excluded from data sets, and data collections can introduce silences by denying or ignoring marginalized groups their voices and representation, whether consciously or unconsciously. As suggested above, representational biases are inherently harmful as they reinforce existing discrimination, meaning that the use of a skewed data foundation in research leads to a poor representation of all voices. It is for this reason that we emphasize the importance of understanding representation bias for DH research which builds on contemporary language technology. To illustrate this point, this article uses an innovative experimental design to exemplify representation biases in existing frameworks for Danish NER.

As a task, NER builds on the logic of categorization. An entity in a text is either recognized or unrecognized as an instance of a category (e.g. ‘PERSON’, ‘LOCATION’). Technological or analogical enforced, categorization comes with the risk of producing residual categories left out when the categories are established (Star and Bowker 2007). In cases where objects are too complicated to classify within the often taken-for-granted categories or are unknown to the system, they may be categorized as ‘other’, falling into the cracks of the categorization schema and relegation into the residual spaces. This can result in the exclusion and marginalization of certain data or information that does not conform to pre-defined categories or norms. Accordingly, this can lead to the experiences of social groups being disregarded or overlooked by simply not being categorized appropriately. Hence, silencing, as the denying social groups of their voices and representation, follows from being relegated into the residual spaces as they risk being disregarded in further analysis.

When DH research applies NLP systems, the research is conducted on the basis of technological tools and computational methods, and the power to decide who and what is included shifts away from archivists and researchers to the technological configurations. While Carter focuses on archivists’ role in the formation of shared memory and cultural understanding, our research focuses on the use and representation of data sets in the humanities as containing the same potential for impacting and contributing to collective knowledge. Where the archivist possesses the power to decide who is included and excluded in archives, using NLP to organize, structure and present data in research contexts gives technology a powerful role in determining who and what is included.

To reiterate the above-stated point in the context of NER, a binary classification (i.e. instance or not instance of ‘person’ category) runs the risk of silencing people whose names are unrecognized by automated systems by relegating them to residual spaces in data sets. The harmful aspects of such performance bias are the consequences of potentially being excluded from the functionalities of these automated systems, which can result in social groups’ perspectives and voices being overlooked in data sets, and excluded in a research context, and, thereby, silenced. We will return to this argument in greater detail in Section 4, where we present three cases from DH research where technologically enforced silencing should be considered.

While it is important to be aware of the social dynamics at play in determining which social groups risk being silenced, it is also crucial to recognize the ways in which intersecting identities can exacerbate biases and discrimination. In the following section, we outline how the framework of intersectional feminism becomes particularly relevant for understanding bias in NLP.

1.1.2 Intersectionality

The framework of intersectional feminism adds an important layer to understanding biases in NLP. Intersectionality refers to biases and discrimination that intersect and potentially amplify across multiple social categories like race, gender, age, sexual orientation, and other identity markers (Crenshaw 2013). Discrimination cannot be solely analysed based on a single factor. For example, women from an ethnic minority might experience other types of discrimination than women from the ethnic majority and still others than those experienced by men from the ethnic minority. Through the lens of intersectional feminism, we recognize the complex and interconnected nature of social identities and how they interact to shape multidimensional experiences of discrimination and bias. Fig. 1 shows the subgroups when the dimensions of gender and ethnicity are intersecting.

However, much of contemporary work on bias in NLP and ML only accounts for a single dimension of oppression at the time—often either gender (Basta, Costa-jussà, and Casas 2019; Kurita et al. 2019) or race (Manzini et al. 2019; Field et al. 2021). When multiple bias markers are examined, their interaction is often ignored (Garg et al. 2018; Czarnowska, Vyas, and Shah 2021; Nadeem, Bethke, and Reddy 2021). Collectively, these studies neglect the intersections of multiple dimensions of discrimination and ignore how these systems affect subgroups in society.

Few studies have covered intersectional biases in NLP by including multiple demographic dimensions in the evaluation of NLP frameworks and tasks. Herbelot, von Redecker, and Müller (2012) conduct a quantitative analysis of concepts from gender studies and propose a methodological approach to explore intersectional bias in word representations: distributional data can be a useful representation of social phenomena in an analysis of the discursive trends around gender. In a more recent line of work, Lalor et al. (2022) investigate allocation biases by conducting benchmarking experiments on multiple NLP models, assessing fairness and predictive performance across diverse NLP tasks. Furthermore, Subramanian et al. (2021) evaluate and compare different debiasing techniques within the context of intersectional biases. They provide a comprehensive analysis of the effectiveness of various debiasing techniques and propose a novel post hoc debiasing method that is particularly effective for addressing intersectional biases in NLP frameworks.

1.2 Bias in NLP

Numerous studies to date emphasize the presence of unintentional biases in NLP systems, resulting in consistent performance disparities across different demographic groups (Zhao et al. 2018; Borkan et al. 2019; Gaut et al. 2020). NLP as a discipline has responded to this challenge, with several different approaches and metrics have been developed to measure and alleviate potential biases (Borkan et al. 2019; Shah, Schwartz, and Hovy 2019; Blodgett et al. 2020; Gaut et al. 2020; Czarnowska, Vyas, and Shah 2021). Furthermore, an increasing body of research has suggested that Counterfactual Data Augmentation has significant potential when it comes to mitigating biases within NLP frameworks, with applications specifically in coreference resolution (Zhao et al. 2018) and demonstrated effectiveness in a wider range of NLP tasks (Lu et al. 2020). The present article takes as its starting point earlier work in this area Lassen et al. (2023), proposing another use of data augmentation, namely as a method to test the robustness of NER models and uncover potential social biases in the models. This approach has already revealed that Danish NLP models are biased in terms of differences in error rates measured as average F1 scores across social groups.

However, while the F1 score for tasks like NER provides a single value summarizing the trade-off between precision and recall, it does not provide detailed information on how errors are distributed. Hence, using F1 scores for each group is a relatively coarse approach to assessing the presence of biases. In order to better understand how errors are distributed, this work opens up the performance metrics to provide a more fine-grained fairness analysis informed by the field of algorithmic fairness. This gives more nuanced insights into the performance of individual Danish NER models, their potential biases, and the consequences for those working in DH more generally.

1.2.1 Name lists as proxies for social groups

Given the sociological evidence and the recognition of representational biases as inherently harmful, it is important to consider the impact of language technologies on those parts of the population which are particularly vulnerable to discrimination (Ranchordás and Scarcella 2021; Jørgensen 2023). This is especially true when aiming for fair and inclusive research. To test for potential error disparities for NER across demographic groups, we divide our dataset into subgroups serving as proxies for the demographic subgroups in question. Each NER model is tested on versions of the dataset which have been augmented with names drawn from name lists covering different demographic subgroups, and error disparities are calculated based on the performance across multiple runs. A detailed discussion of this method and the sources of the name lists can be found in Section 2 below, but our primary argument is that NER models are unfair if they display an error disparity on different groups—such as performing worse on names typically used for women, or names typically used by minority groups.

Deciding which subgroups to include requires contextualization and sensitivity to the social sphere in which a fairness analysis is conducted. This includes a focus on preexisting biases (Friedman and Nissenbaum 1996) that is a fundamental part of bias in computer systems, as outlined above in Section 1.1. For the purposes of this experiment, we limit ourselves only to context of contemporary Denmark and Danish NER technology and to potential biases along the lines of gender and ethnicity.

On the gender dimension, Denmark has a high level of formal equality, with anti-discrimination laws ensuring constitutional equality and discrimination protection. However, structural oppression still exists and can be shown in studies on the gender pay gap (Gallen, Lesner, and Vejlin 2019) and in statistics on violence against women (European Union Agency for Fundamental Rights 2014). The work by Dahl and Krog (2018) furthermore showed the effects of intersectional discrimination in the Danish labour market. Moreover, Denmark has strict name laws determining which names a person can use according to their assigned gender (something which has been actively criticized by citizen activist groups).1 As such, Danish naming conventions largely reflect a binary conception of gender, but some names are gender-neutral (‘unisex’) and can be used by individuals of any gender. We are aware of the multiple perspectives and understandings of the notion of gender, and that gender can be understood as both performative and constituted by discursive practices (e.g. Butler 2006). Our experiments include these unisex names in an attempt to move beyond this binary conception of gender. In our analysis, we infer gender only at the group level to evaluate potential biases for different demographic groups when subjected to NLP frameworks. Moreover, we do not link names to pronouns and avoid inferences about individual gender identity.

Along the ethnicity dimension, the largest immigrant community has members descended from Middle Eastern and Muslim countries (Statistics Denmark 2022). Research has shown how people in this group experience various types of discrimination spanning from harsh rhetoric in political discourse over ministerial administration (Vinding 2020) to hate crimes (Mannov 2021) and exclusion of labour market (Dahl and Krog 2018). For the purpose of this study, we choose to limit our analysis of minority names to specifically those which follow traditional Muslim naming conventions.

Working with names divided into different sub-groups comes with both pros and cons. On the one hand, deploying proxies for gender and ethnicity makes it possible to conduct an intersectional analysis, examining how gender and ethnicity contribute to the observed error disparities. On the other hand, the disadvantages of augmentation based on naming conventions risks reinforcing a folk conception of gender (Keyes 2018), where gender is understood as binary and static, and ruling out other gender identities (Dev et al. 2021). Danish names are neither inherently nor definitively gendered, and the implementation of laws restricting the choice of name based on the sex assigned at birth emphasizes how ideology is present both in Danish name laws and in the language in general (Blodgett et al. 2020).

Likewise, limiting our analysis to names of typically Muslim origin means that not all minority peoples in Denmark are included in our study, pointing to a known limitation of our work. However, names are often interpreted as markers of group affiliation (Khosravi 2012); and research has shown that in Denmark, people with names associated with Middle Eastern roots are regularly subjected to discrimination on the basis of their names (Dahl and Krog 2018). As such, while acknowledging the limitations of our work, we argue that testing performance for this group is a necessary step for quantifying bias in NLP frameworks. We emphasize that our use of name lists is only a proxy for both gender and ethnicity, as the choice of names can vary in a minority group as well as in the majority group. However, when it comes to biases based on names, people with majority names are not at the same risk of being unfairly treated as people with minority names, regardless of their ethnicity (e.g. job applicants face discrimination if their names indicate they are a minority man (Dahl and Krog 2018)).

Finally, situations where minority individuals may have names associated with the majority and vice versa, are not the primary focus of our work, as we are examining the distribution of error rates across different social groups while only inferring at a group level.

1.3 Fairness measures

In analysing performance disparities between social groups and the resulting unfair treatment of those impacted by such disparities, the concept of group fairness from the ML literature on algorithmic fairness provides valuable insights into operationalizing fair and just treatment. Group fairness is defined as equitable treatment of different groups, ensuring that predictions do not disproportionately favour or disadvantage one or multiple social groups. This contrasts with individual fairness, which aims to ensure that similar individuals are treated similarly regardless of their membership in any particular group. While individual fairness focuses on providing equitable treatment to each individual, group fairness balances the distribution of treatments and resources between various groups.

One method for assessing group fairness is statistical parity between groups, for instance, ensuring an equal student acceptance rate for women and men on college admissions (Dwork et al. 2012). From an individual perspective (i.e. individual fairness), though, it may seem unfair that a model favours a student candidate from a minority group that is less qualified than the member of the majority group. From a model perspective, the model loses accuracy to maintain statistical parity resulting in a parity-accuracy dilemma. Performance-based metrics, such as equality of opportunity, equalized odds, and positive predictive parity have been developed to manage this dilemma and ensure fair ML applications (Hardt, Price, and Srebro 2016; Verma and Rubin 2018). In what follows, we consider these metrics in more detail, and highlight what they mean in the context of NER.

Defining and measuring what is meant by fairness is somewhat trickier for NER than it is for the student admission problem sketched above. Consider a tool used to identify the names of individuals within a larger corpus (i.e. entities tagged as PERSON or PER) and consider furthermore two broadly defined social groups, for example, men and women, or minority and majority ethnicity. Equality of opportunity in the context of NER would hence signify that when such a tool encounters the name of any given individual, the likelihood of correctly recognizing that name should be equal regardless of the social group with which the name is associated. In other words, names typically associated with men should be equally likely to be recognized as names typically associated with women, and that minority names should be as correctly predicted as majority names.

Equality of opportunity can be expressed as the balance of predictions between different groups, where the fraction of predicted positives out of actual positives in group a should be the same for group b. This means that the probability of each class prediction C given some label Y should be equivalent for both groups. Expressed formally:
Since equality of opportunity relies on the ratio of true positives, this means by definition that a fair model should be equally good at recognizing instances of a given positive class for all groups in question. Calculating the true positive rate (TPR) for each group, this means:

In our particular use case, this entails that a fair NER model is one which has a balanced TPR for the PERSON class across all groups such as women and men, or minority and majority names. A large imbalance in the TPR indicates an unfair model and indicates the direction of the imbalance.

Equalized odds go further than equality of opportunity, in that it ensures not only the balance of the positive class, but also the same proportion of false positives across the prescribed groups. For NER, this means that the number of items not categorized as PER should likewise be balanced. This means that the fraction of predicted negative classes out of actual negatives in group a should be the same for group b, expressed formally as:
Mathematically, this means that the principle of equalized odds is more restrictive than equality of opportunity since it targets both true positives and false positives. In other words, a fair model according to equalized odds should have balanced true negative rate (TNR) for all groups in question:
However, it is important to note that, in the context of NER, the number of true negatives corresponds to all tokens which are not PERSON entity and that the model correctly identifies as not a PERSON entity. Since most tokens in any given corpus are not PERSON entities, the number of TN is larger by several magnitudes. For any arbitrarily large corpus:

As such, we find that calculating TNR in the context of NER does not provide any meaningful information, and so disregard equalized odds as a useful metric in quantifying fairness.

There is one additional performance metric of algorithmic group fairness called predictive rate parity, also referred to as calibration within group. Here we wish to ensure that those things which are labelled as PERSON actually belong to the category person and that non-names are not being erroneously categorized as PERSON for any given group:
We can further define this as the ratio of true positives out of all predicted positives should be the same for each group. In other words, the positive predictive value (PPV) should not depend on the group membership:

Kleinberg, Mullainathan, and Raghavan (2017) have identified an impossibility theorem of fairness which is important for our work, since it states that either predictive parity or equality of opportunity can hold at the same time, but not both. An exception to this is special cases where: (1) the model in question performs perfect predictions, C=Y; or (2) there are equal base rates in the groups, Pra(Y=1)=Prb(Y=1). An equal base rate allows a more meaningful performance comparison across the different social groups. This is because observed performance differences cannot be solely attributed to differences in the data distribution of positive instances in a named entity extraction task. Hence, if the results in our experiments violate the above criteria, we can conclude that it is due to unfair performance differences across social groups. In what follows, we ensure that the second condition holds through a process of data augmentation (see Section 2.2), which allows us to compare across fairness criteria.

1.4 Research questions

We have so far outlined how unequal performance across groups can lead to the exclusion and marginalization of certain social groups, leading to voices and experiences being disregarded and silenced. Additionally, we have highlighted the importance of recognizing the intersectional nature of identities in exacerbating biases and discrimination.

Building on existing research into algorithmic fairness, we have introduced a number of different performance-based metrics which can be used to quantify potentially imbalanced model performance based on variations in predictions across different groups. We demonstrated that, in the context of NER, equalized odds (i.e. balanced TNR across groups) provide no additional information. Instead, we choose to focus on equality of opportunity (i.e. balanced TPR) and predictive rate parity (i.e. balanced PPV). We ensure that these metrics are compatible through a process of data augmentation, to be outlined in the following section.

Drawing this all together, we aim to measure representational biases in NER frameworks through the lens of intersectionality and use algorithmic fairness metrics to analyse potential disparities in NER performance across different social groups. We choose to work specifically in a narrow linguistic and cultural context of contemporary Danish NLP.

This leads to the following research questions:

  • RQ1 To what extent are Danish NLP models fair under the algorithmic fairness measures?

  • RQ2 How does algorithmic fairness, or the lack thereof, affect knowledge production in DH?

2. Method

2.1 Data

In our experimental setup, we have deployed the name lists retrieved by Lassen et al. (2023), and we refer readers to that article for a more detailed overview of the lists and how they are constructed. As a brief summary here, though, it is worth mentioning that the list of minority names is retrieved from Meldgaard (2005) containing roughly 1,000 names, and for minority last names, a list of Muslim last names is retrieved from FamilyEducation.2 Majority first and last names lists are retrieved from Statistics Denmark,3 a governmental agency which handles demographic data in Denmark, filtered on the 500 most common names for men, women, and last names. Finally, the list of unisex names is retrieved from The Agency of Family Law4 filtered on the 500 most popular unisex names according to the data from Statistics Denmark.

2.2 Experimental pipeline

The experimental pipeline is set up as follows. For each sentence in the DaNE dataset (Hvingelby et al. 2020), we augment the dataset by replacing each PERSON entity with a name randomly sampled from one of the given lists functioning as proxies for the social groups. To avoid nonsensical sentences, we ensure that within one document, a specific name is always replaced by the same name. See step 1 in Fig. 2.

Intersecting gender and ethnicity results in the subgroups A= majority women, B= minority women, C= majority men, D= minority men (Subramanian et al., 2021)
Figure 1

Intersecting gender and ethnicity results in the subgroups A= majority women, B= minority women, C= majority men, D= minority men (Subramanian et al., 2021)

Experimental pipeline: First, for each sentence in the DaNE dataset, we augment the dataset by replacing each PERSON entity with a name randomly sampled from one of the given lists. To avoid nonsensical sentences, we ensure that within one document, a specific name is always replaced by the same name, exemplified in the figure by ‘Mette’ ⇒ ‘Fatima’ in all cases. Second, the NER performance on PERSON entities for all models is tested on the augmented data
Figure 2

Experimental pipeline: First, for each sentence in the DaNE dataset, we augment the dataset by replacing each PERSON entity with a name randomly sampled from one of the given lists. To avoid nonsensical sentences, we ensure that within one document, a specific name is always replaced by the same name, exemplified in the figure by ‘Mette’ ‘Fatima’ in all cases. Second, the NER performance on PERSON entities for all models is tested on the augmented data

Following the data augmentation, the NER performance on PERSON entities for all models is tested on the augmented data and estimated by calculating precision, recall, and confusion matrix counts. As the random choice of name influences the performance, we repeat this process twenty times for each model to estimate mean scores (see Step 2 in Fig. 2). We have included all existing frameworks which can be used to perform NER on Danish language data. For a description of each of the models we refer readers to Lassen et al. (2023).

Finally, we used a t-test to compare whether the scores obtained on the augmented data varied significantly from the baseline. For the baseline, we used the majority names for both genders. Be aware that the data on ‘Majority all’ is also augmented, as the original data might consist of names from all lists. As we perform multiple comparisons, we adjust the p-values using a Bonferroni correction.

In contrast to Lassen et al. (2023) which tested NER performance on all named entities, we have instead tested for PERSON entities only. Where the test on all named entities might provide insights into general model robustness, focusing only on PERSON entities provides a more realistic approximation of how individuals might be affected by the system. As we examine how unequal performance across groups can lead to the exclusion and silencing of certain social groups, the focus on PERSON entities is better aligned with the intent of this study.

The name augmentation was performed using Augmenty (Enevoldsen 2022), and the model evaluation was performed using DaCy framework (Enevoldsen, Hansen, and Nielbo 2021). All code is made publicly available and published under the open-source Apache 2.0 license.5

From the average confusion counts, we calculate the fairness metrics described in Section 1.3. To assess whether the calculated scores for PPV and TPR are close enough to qualify as well-calibrated and balanced for the positive class, respectively, we loosen strict equality to a 0.05 difference. A similar approach can be found in Verma and Rubin (2018); however, more sophisticated methods are suggested by Zafar et al. (2017).

3. Results

To test the model performance, we ran the experiment twenty times and retrieved averaged scores for all measures, specifically precision, recall, F1, true positives, true negatives, false positives, and false negatives. We performed a t-test to compare whether the scores obtained on the augmented data varied significantly from the majority names for both genders, which functioned as the baseline. As we perform multiple comparisons, we make sure to adjust the P-values using a Bonferroni correction. From the average confusion counts, we analysed each model according to the fairness criteria described in Section 1.3.

When calculating the fairness metrics for all the models included in the experimental pipeline, we see that the models ScandiNER and DaCy large are compliant with the fairness criteria above. However, the rest: DaCy medium and small, DaNLP BERT, Flair, all SpaCy models, and Polyglot are shown to be unfair towards certain groups. In the following section, the fairness results for two models are presented, one fair and one unfair. For the rest of the models and the corresponding fairness analysis, we refer readers to the Supplementary Appendix.

3.1 An unfair model: SpaCy large

For SpaCy large, we have obtained the results shown in Table 1. From the counts in the confusion matrix (true positive, false negatives, and false positives), we can calculate the positive predicted values (PPVs) and the TPR.

Table 1.

Average confusion counts for SpaCy large on NER on PERSON entities.

ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
SpaCy large (3.4.0)TP168.9(2.6)124.7(6.3)*170.7(2.5)130.8(4.1)167.4 (3.3)111.0(5.6)*152.6(4.4)*
FN11.1(2.6)55.3(6.3)9.3(2.5)49.2(4.1)*12.6(3.3)69.0(5.6)*27.5(4.4)*
FP27.0(2.2)33.4(3.2)*26.5(1.7)33.7(3.7)*27.8 (2.4)33.3 (3.1)*29.0(1.4)*
PPV0.860.790.870.800.860.770.84
TPR0.940.690.950.730.930.620.85
ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
SpaCy large (3.4.0)TP168.9(2.6)124.7(6.3)*170.7(2.5)130.8(4.1)167.4 (3.3)111.0(5.6)*152.6(4.4)*
FN11.1(2.6)55.3(6.3)9.3(2.5)49.2(4.1)*12.6(3.3)69.0(5.6)*27.5(4.4)*
FP27.0(2.2)33.4(3.2)*26.5(1.7)33.7(3.7)*27.8 (2.4)33.3 (3.1)*29.0(1.4)*
PPV0.860.790.870.800.860.770.84
TPR0.940.690.950.730.930.620.85

The column ‘Majority all’ is considered the baseline for the tests on minority, women’s, men’s, and unisex names. * denotes that the result is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values in parentheses denote the standard deviation. PPV and TPR are calculated on the basis of the average confusion counts.

Table 1.

Average confusion counts for SpaCy large on NER on PERSON entities.

ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
SpaCy large (3.4.0)TP168.9(2.6)124.7(6.3)*170.7(2.5)130.8(4.1)167.4 (3.3)111.0(5.6)*152.6(4.4)*
FN11.1(2.6)55.3(6.3)9.3(2.5)49.2(4.1)*12.6(3.3)69.0(5.6)*27.5(4.4)*
FP27.0(2.2)33.4(3.2)*26.5(1.7)33.7(3.7)*27.8 (2.4)33.3 (3.1)*29.0(1.4)*
PPV0.860.790.870.800.860.770.84
TPR0.940.690.950.730.930.620.85
ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
SpaCy large (3.4.0)TP168.9(2.6)124.7(6.3)*170.7(2.5)130.8(4.1)167.4 (3.3)111.0(5.6)*152.6(4.4)*
FN11.1(2.6)55.3(6.3)9.3(2.5)49.2(4.1)*12.6(3.3)69.0(5.6)*27.5(4.4)*
FP27.0(2.2)33.4(3.2)*26.5(1.7)33.7(3.7)*27.8 (2.4)33.3 (3.1)*29.0(1.4)*
PPV0.860.790.870.800.860.770.84
TPR0.940.690.950.730.930.620.85

The column ‘Majority all’ is considered the baseline for the tests on minority, women’s, men’s, and unisex names. * denotes that the result is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values in parentheses denote the standard deviation. PPV and TPR are calculated on the basis of the average confusion counts.

For the PPV, we see that for minority women’s names, the calculated value is 0.77, and for majority men’s names, it is 0.87. This means that the fraction of predicted positive instances that are actually true PERSON instances are higher for majority men than for minority women. For the TPR, we see a similar pattern: For minority women’s names, the calculated value is 0.62, and for majority men’s names, it is 0.94. This means that the model is better at recognizing instances of majority men’s names compared to minority women’s names. These results show that SpaCy large is neither well-calibrated for each group, nor is it the model balanced for the positive class, and we, therefore, conclude that SpaCy large is not fair.

We furthermore note that SpaCy large has balance for the positive class for the majority of men’s and women’s names. This shows the importance of carefully considering the social groups included in a fairness analysis, as one might be inclined to conclude that there are no gender biases at play if minority names are left out of the analysis.

Calculating PPV and TPR for DaCy medium, DaCy small, DaNLP BERT, and Flair, we see that those are well-calibrated across groups, but they are not balanced for the positive class. In-depth arguments for why calibration is not enough have been made elsewhere (Hedden 2021; Larson Surya Mattu, Kirchner, and Angwin 2016), but for now, we rely on the criteria outlined in section 1.3 and conclude that the models are not fair. Furthermore, SpaCy medium, SpaCy small, and Polyglot are neither well-calibrated nor are they balanced for the positive class, and we conclude that these models are unfair.

3.2 A fair model: DaCy large

Similarly, for DaCy large, we have obtained the results shown in Table 2.

Table 2.

Average confusion counts for DaCy large on NER on PERSON entities.

ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
DaCy large (0.1.0)TP176.9(1.4)174.4(1.6)*176.4(1.5)175.1(2.4)176.7(1.8)174.5(1.6)*174.3(1.6)*
FN3.2(1.4)5.7(1.6)*3.7(1.5)4.9(2.4)3.4(1.8)5.5(1.6)*5.8(1.6)*
FP17.6(1.4)17.9(1.6)18.5(1.8)17.7(1.4)17.6(0.9)17.1(0.8)*18.1(1.4)
PPV0.910.910.910.910.910.910.91
TPR0.980.970.980.970.980.970.97
ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
DaCy large (0.1.0)TP176.9(1.4)174.4(1.6)*176.4(1.5)175.1(2.4)176.7(1.8)174.5(1.6)*174.3(1.6)*
FN3.2(1.4)5.7(1.6)*3.7(1.5)4.9(2.4)3.4(1.8)5.5(1.6)*5.8(1.6)*
FP17.6(1.4)17.9(1.6)18.5(1.8)17.7(1.4)17.6(0.9)17.1(0.8)*18.1(1.4)
PPV0.910.910.910.910.910.910.91
TPR0.980.970.980.970.980.970.97

The column ‘Majority All’ names is considered the baseline for the tests on minority, women’s, men’s and unisex names. * denotes that the result is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values in parentheses denote the standard deviation. PPV and TPR are calculated on the basis of the average confusion counts.

Table 2.

Average confusion counts for DaCy large on NER on PERSON entities.

ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
DaCy large (0.1.0)TP176.9(1.4)174.4(1.6)*176.4(1.5)175.1(2.4)176.7(1.8)174.5(1.6)*174.3(1.6)*
FN3.2(1.4)5.7(1.6)*3.7(1.5)4.9(2.4)3.4(1.8)5.5(1.6)*5.8(1.6)*
FP17.6(1.4)17.9(1.6)18.5(1.8)17.7(1.4)17.6(0.9)17.1(0.8)*18.1(1.4)
PPV0.910.910.910.910.910.910.91
TPR0.980.970.980.970.980.970.97
ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
DaCy large (0.1.0)TP176.9(1.4)174.4(1.6)*176.4(1.5)175.1(2.4)176.7(1.8)174.5(1.6)*174.3(1.6)*
FN3.2(1.4)5.7(1.6)*3.7(1.5)4.9(2.4)3.4(1.8)5.5(1.6)*5.8(1.6)*
FP17.6(1.4)17.9(1.6)18.5(1.8)17.7(1.4)17.6(0.9)17.1(0.8)*18.1(1.4)
PPV0.910.910.910.910.910.910.91
TPR0.980.970.980.970.980.970.97

The column ‘Majority All’ names is considered the baseline for the tests on minority, women’s, men’s and unisex names. * denotes that the result is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values in parentheses denote the standard deviation. PPV and TPR are calculated on the basis of the average confusion counts.

For PPV, we see that the calculated value is 0.91 for all groups, indicating that the fraction of predicted positive instances that are actually true PERSON instances is equal across all the groups. For the TPR, we have a calculated value of 0.98 for majority names and 0.97 for minority and unisex names. While the values are not precisely equal, we consider this difference as minor and conclude DaCy large is both well-calibrated within groups as well as balanced for the positive class. This is true across all groups. A similar analysis for ScandiNER shows the same tendencies, and we conclude that this model is fair.

4. Case studies

These results have clear implications in social contexts specifically related to group fairness. However, we also contend that this has significant implications for researchers working in digital and computational humanities. As demonstrated in Section 3 above, the choice of model for Danish NER can result in meaningful sub-group error disparities, and these disparities lead to inaccurate representations of the underlying data. In this section, we provide three illustrative case studies in which potential unfair treatment of groups impacted by performance disparities can lead to problems on downstream tasks and ultimately impact cultural knowledge production.

4.1 Case Study 1—historical representation

Digitization of cultural heritage data is ongoing across the world, not least in Denmark (Bjerring-Hansen et al. 2022). This newly digital data opens up the possibility of employing NLP techniques such as NER for the purposes of cultural analytics and digital history. However, due to noise and variance typically found in digital historical texts (Ehrmann et al. 2016; Schweter et al. 2022), the application of NER to historical data is a non-trivial process. Researchers have generally been attuned to these problems and there have been significant advances in recent years to address these problems computationally (Schweter and Baiter 2019; Boros et al. 2020).

Despite the impressive improvements made in historical NLP in recent years, there has been significantly less discussion of issues such as bias and fairness. This issue is highlighted by Manjavacas and Fonteyn (2022), who draw attention to the possibility of anachronistic biases entering into analyses through the use of contemporary pretrained language models to study historical language data. The result of this is an exciting and performant historical BERT model for English, a significant development in computational humanities research. However, as is the standard in the field, competing models are evaluated by comparing F1 scores for specific tests such as NER but do not feature any quantified metrics regarding bias or fairness.

In both contemporary and historical NLP, there is a continued privileging of bottom-line accuracy as the only measure of performance. However, as we have shown in our study, model performance is not evenly distributed across groups—a problem which is doubly challenging when working with historical data. As a more concrete illustration, consider a DH project which seeks to model representation of minorities over the last one hundred years in Danish newspapers. Such a project would not only have to contend with OCR errors and noisy data typical of this kind of work but also have to take into account the fact that different Danish NER frameworks show significant performance degradation when handling minority names. Given the outstanding problems of applying contemporary NLP solutions to historical data, it seems likely that this performance degradation would only be amplified.

Addressing these questions is a non-trivial task in itself, requiring a sophisticated understanding of historical contexts and shifting modes of representation in historical data. Nevertheless, we suggest that closer consideration of model fairness should be an integral part of research involving historical NER and historical NLP more generally.

4.2 Case Study 2—relation extraction

NER is often used as a sub-component of a larger analysis pipeline. One particularly interesting example of this is the narrative analysis framework developed in Tangherlini et al. (2020) and Shahsavari et al. (2020). This approach is designed specifically to study conspiracy theory narratives and how they develop and spread over time, between individuals and across media. The complete framework is a complex and sophisticated solution to this problem, but one of the core sub-components consists of entity relation extraction performed using NER (in this case with Flair).

This is valuable research which addresses a complex social problem drawing on state-of-the-art methods in DH to address a pressing social concern. Nevertheless, as we have demonstrated here, NER models can be shown to perform inconsistently across demographic sub-groups—something which may be a feature of the framework developed in the framework cited above. Moreover, if such a performance disparity does exist, it could feasibly cause substantial downstream problems, with subsequent sub-components of the system building on inadequate representations of the data. It is, therefore, critically important for researchers to be able to account for potentially unfair treatment of different demographics, given clear social implications of this work in understanding and combating conspiracy theories on social media.

It is important to note that our present results do not directly address these concerns, insofar as there is no guarantee that English NER models exhibit the same imbalances that we see in Danish models. Indeed, since English is a comparatively more well-resourced language than Danish in terms of pre-trained models and available data, it may well be that English NER systems circumvent some of the issues faced by those of us working in Danish language processing. That being said, there are two essential points that can be drawn from our results and this case study. First, our results demonstrate that, in some cases, models perform with wildly different levels of calibration and balance when predicting the positive class. We suggest that it is worth conducting similar experiments on English NER models, while making required changes based on cultural contexts such as name distributions. Second, researchers who would transfer existing frameworks to different languages—from Danish to English, for example—must be sensitive to the issues outlined in the results we describe here.

4.3 Case Study 3—network analysis

In Case Study 2, NER was a sub-component of a much larger computational framework which integrated the output from the NER system alongside other complex sources of information. In such a scenario, it can be challenging to determine the knock-on effect of performance disparities on specific downstream tasks. However, the problem is just as pronounced if we consider a scenario where the relationship between specific named entities extracted from data is the main object of analysis, as is the case with social network analysis (SNA).

By now, SNA has been widely employed in a wide range of DH contexts: from comparatively small-scale studies of prodigal son characters in Early Modern English drama (Ladegaard and Kristensen-McLachlan 2023) and protestant letter writing networks (Ahnert and Ahnert 2015); up to larger corpus-based approaches to quantitative character network across hundreds of years of English literary history (Algee-Hewitt 2018). In many cases, the nodes which comprise the networks are taken directly from the data, such as speaking turns explicitly indicated in TEI-encoded XML documents. In other cases, though, these nodes need to be extracted from texts by other means. For example, in a study of social networks in War and Peace, Fischer and Skorinkin (2021) first use NER to find relevant nodes in the text.

This final approach also forms the basis of DH research in a specifically Danish context, such as in Agersnap et al. (2022), which consists of a network-based analysis of cultural figures who appear in corpus contemporary sermons from the Danish national church (Agersnap et al. 2020). At the core of this study was data extracted from the sermon corpus of all entities marked as PER, with edges created in the case of document co-occurrence between entities. At the time of conducting the study, the only usable NER system for Danish was Polyglot. However, as demonstrated above, this particular framework violates our fairness criteria and is neither well-calibrated nor balanced for minority names. This suggests that the previous study may actually under-represent the number of minority individuals mentioned in the corpus. If this was shown to be the case, it would potentially undermine the results in Agersnap et al. (2022), inadvertently reproducing and reinforcing cultural biases against minority names by overlooking their presence in contemporary Danish sermons.

5. Discussion

From the case studies outlined in Section 4, it becomes apparent that using models not compliant with the outlined fairness criteria can have a direct impact on knowledge production in the DH. Whether NER is used as a sub-component of a larger analysis or if the extracted entities constitute the main object of analysis, the results might be considerably skewed if the applied tool does not perform equally well for all groups present in a research area.

Therefore, in order to aim for fair and inclusive research, we argue that DH needs to take fairness concerns seriously when integrating ML and advanced language technologies into its research practices.

5.1 Insights from fairness analysis

Assessing fairness only through a statistically significant difference in F1 score, as reported in Lassen et al. (2023), suggests that all Danish NLP models are biased towards minority names and unisex names but do not provide insights into how errors are distributed. However, this is not the conclusion when deploying the fairness measures described in Section 1.3, as our results show that DaCy large and ScandiNER are fair across all social groups included in our experiments.

The fairness analysis for Danish NER tools conducted in this article allows for more fine-grained insights into the biases and the distribution of errors compared to the performance-focused measure F1. Evaluating calibration within groups and balance for the positive class show that all models, even those deemed unfair, are better calibrated than they are balanced. This means that out of the predicted positive instances, most of these are actual true instances. On the other hand, imbalance for the positive class indicates that the models in question are not equally good at recognizing positive instances, PERSON entities, for all social groups.

Relating these findings to the concept of silencing outlined in Section 1.1, we see that minority people are at higher risk of not being recognized as a positive instance and hence relegated into the residual space in Danish NLP. The categorization performed by the NER tools can potentially lead to silencing by denying or ignoring minority people their voices and representation. In other words, biases in Danish NER models—in this case expressed through the poorer ability to recognize certain social groups—impact the representation of these social groups. This has potential knock-on effects for knowledge production in DH research, which relies on NER as part of its computational and analytical framework.

In addition, we have demonstrated how analysing fairness through an intersectional lens can reveal whether biases and discrimination are amplified for specific subgroups. Our findings indicate that some of the unfair models exhibit lower calibration and balance for the positive class for minority women’s names compared to minority men’s names. The most noticeable differences are observed in the results from the Polyglot model, where the TPR values for minority women’s and minority men’s names are 0.16 and 0.23, respectively, although both are very low relative to majority names. Similarly, for the SpaCy medium model, the calculated TPR values are 0.52 for minority women’s names and 0.66 for minority men’s names. This demonstrates that dividing the performance measures and fairness groups into smaller subgroups can expose biases that may not be evident when looking at an overall performance or a single dimension of discrimination at a time. As shown in this work, using tools with unequal performance rates can have severe social and epistemic consequences, and we, therefore, invite practitioners of Danish NLP to consider these results when choosing tools for their NLP pipelines. We furthermore encourage scholars to lean on the suggested experimental approach to obtain similar analyses for other languages and social contexts.

5.1.1 NLP and impossibility results

From the perspective of the impossibility theorem of fairness, it is noteworthy to observe that even in the edge case where the data set has equal base rates for all groups, due to data augmentation, we still find that several models display biased inferences (see Table 3). It is equally important to notice that fair models are either monolingual or, in the case of multi-lingual models, only trained on languages that are similar and closely related (see Table 4) This last observation provides support for the claim that, at least for some tasks, monolingual models outperform multi-lingual models and that investment in training models also for smaller, ‘low-resource’ languages pays off in terms of bias reduction.

Table 3.

Unfair models: Average confusion counts for the unfair models on NER on PERSON entities.

ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
DaCy medium (0.1.0)TP170.1 (3.1)153.4 (4.8)*171.6 (2.7)153.7 (2.9)*171.2 (2.4)157.2 (3.9)*162.2 (3.2)*
FN9.9 (3.1)26.6 (4.8)*8.5 (2.7)26.4 (2.9)*8.8 (2.4)22.9 (3.9)*17.8 (3.2)*
FP23.1 (2.3)26.0 (3.2)*23.3 (2.6)26.0 (3.2)*23.1 (2.0)23.7 (2.5)24.5 (2.0)
PPV0.880.860.880.860.880.870.87
TPR0.950.850.950.850.950.870.90
DaCy small (0.1.0)TP163.8 (3.1)151.8 (4.7)*164.2 (3.3)153.7 (5.5)*164.8 (2.8)152.6 (4.3)*157.3 (4.1)*
FN16.2 (3.1)28.3 (4.7)*15.9 (3.3)26.4 (5.5)*15.2 (2.8)27.5 (4.3)*22.7 (4.1)*
FP28.1 (3.5)29.9 (4.0)27.7 (3.0)31.1 (3.9)28.0 (2.7)29.6 (3.5)27.2 (2.5)
PPV0.850.840.860.830.850.840.85
TPR0.910.840.910.850.920.850.87
DaNLP BERTTP174.3 (2.0)162.3 (2.7)*174.8 (2.1)163.3 (3.6)*174.4 (1.6)162.6 (3.8)*166.7 (2.3)*
FN5.7 (2.0)17.8 (2.7)*5.3 (2.1)16.7 (3.6)*5.7 (1.6)17.5 (3.8)*13.4 (2.3)*
FP21.5 (2.0)25.9 (2.6)*21.4 (2.0)25.4 (3.3)*21.6 (1.5)25.6 (2.5)*24.2 (2.2)*
PPV0.890.860.890.870.890.860.87
TPR0.970.900.970.910.970.900.93
FlairTP172.9 (1.7)162.8 (3.0)*175.0 (1.4)*164.4 (2.5)*171.4 (2.9)161.9 (3.9)*163.6 (3.4)*
FN7.2 (1.7)17.3 (3.0)*5.1 (1.4)*15.6 (2.5)*8.6 (2.9)18.1 (3.9)*16.5 (3.4)*
FP19.0 (1.3)18.7 (1.8)18.5 (1.5)19.1 (1.9)19.0 (1.7)19.2 (1.9)20.6 (1.9)*
PPV0.900.900.900.900.900.890.89
TPR0.960.900.970.910.950.900.91
SpaCy large (3.4.0)TP168.9 (2.6)124.7 (6.3)*170.7 (2.5)130.8 (4.1)*167.4 (3.3)111.0 (5.6)*152.6 (4.4)*
FN11.1 (2.6)55.3 (6.3)*9.3 (2.5)49.2 (4.1)*12.6 (3.3)69.0 (5.6)*27.5 (4.4)*
FP27.0 (2.2)33.4 (3.2)*26.5 (1.7)33.7 (3.7)*27.8 (2.4)33.3 (3.1)*29.0 (1.4)*
PPV0.860.790.870.800.860.770.84
TPR0.940.690.950.730.930.620.85
SpaCy medium (3.4.0)TP169.2 (2.9)105.0 (7.2)*169.9 (2.2)118.3 (7.9)*167.6 (3.3)91.2 (4.6)*138.0 (5.0)*
FN10.9 (2.9)75.1 (7.2)*10.1 (2.2)61.7 (7.9)*12.5 (3.3)88.8 (4.6)*42.1 (5.0)*
FP29.6 (1.8)41.6 (5.5)*29.3 (1.8)41.1 (4.5)*30.6 (3.0)44.0 (3.6)*34.7 (2.4)*
PPV0.850.720.850.740.850.670.8
TPR0.940.580.940.660.930.510.77
SpaCy small (3.4.0)TP131.3 (5.2)96.8 (6.6)*128.5 (5.8)95.9 (5.6)*133.2 (5.6)93.3 (6.7)*117.1 (7.4)*
FN48.7 (5.2)83.3 (6.6)*51.5 (5.8)84.1 (5.6)*46.9 (5.6)86.8 (6.7)*63.0 (7.4)*
FP44.3 (2.0)46.2 (1.7)*44.8 (1.5)46.3 (2.0)*44.8 (2.1)44.9 (2.3)44.6 (2.1)
PPV0.750.680.740.670.750.680.72
TPR0.730.540.710.530.740.520.65
PolyglotTP150.0 (4.3)37.3 (4.4)*155.6 (4.8)*41.4 (4.2)*140.5 (5.9)*28.2 (4.7)*113.3 (5.4)*
FN30.1 (4.3)142.8 (4.4)*24.5 (4.8)*138.7 (4.2)*39.5 (5.9)*151.9 (4.7)*66.8 (5.4)*
FP58.5 (3.5)85.7 (6.6)*56.2 (4.9)86.5 (5.7)*63.1 (3.6)*84.4 (5.7)*74.2 (4.4)*
PPV0.720.300.730.320.690.250.60
TPR0.830.210.860.230.780.160.63
ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
DaCy medium (0.1.0)TP170.1 (3.1)153.4 (4.8)*171.6 (2.7)153.7 (2.9)*171.2 (2.4)157.2 (3.9)*162.2 (3.2)*
FN9.9 (3.1)26.6 (4.8)*8.5 (2.7)26.4 (2.9)*8.8 (2.4)22.9 (3.9)*17.8 (3.2)*
FP23.1 (2.3)26.0 (3.2)*23.3 (2.6)26.0 (3.2)*23.1 (2.0)23.7 (2.5)24.5 (2.0)
PPV0.880.860.880.860.880.870.87
TPR0.950.850.950.850.950.870.90
DaCy small (0.1.0)TP163.8 (3.1)151.8 (4.7)*164.2 (3.3)153.7 (5.5)*164.8 (2.8)152.6 (4.3)*157.3 (4.1)*
FN16.2 (3.1)28.3 (4.7)*15.9 (3.3)26.4 (5.5)*15.2 (2.8)27.5 (4.3)*22.7 (4.1)*
FP28.1 (3.5)29.9 (4.0)27.7 (3.0)31.1 (3.9)28.0 (2.7)29.6 (3.5)27.2 (2.5)
PPV0.850.840.860.830.850.840.85
TPR0.910.840.910.850.920.850.87
DaNLP BERTTP174.3 (2.0)162.3 (2.7)*174.8 (2.1)163.3 (3.6)*174.4 (1.6)162.6 (3.8)*166.7 (2.3)*
FN5.7 (2.0)17.8 (2.7)*5.3 (2.1)16.7 (3.6)*5.7 (1.6)17.5 (3.8)*13.4 (2.3)*
FP21.5 (2.0)25.9 (2.6)*21.4 (2.0)25.4 (3.3)*21.6 (1.5)25.6 (2.5)*24.2 (2.2)*
PPV0.890.860.890.870.890.860.87
TPR0.970.900.970.910.970.900.93
FlairTP172.9 (1.7)162.8 (3.0)*175.0 (1.4)*164.4 (2.5)*171.4 (2.9)161.9 (3.9)*163.6 (3.4)*
FN7.2 (1.7)17.3 (3.0)*5.1 (1.4)*15.6 (2.5)*8.6 (2.9)18.1 (3.9)*16.5 (3.4)*
FP19.0 (1.3)18.7 (1.8)18.5 (1.5)19.1 (1.9)19.0 (1.7)19.2 (1.9)20.6 (1.9)*
PPV0.900.900.900.900.900.890.89
TPR0.960.900.970.910.950.900.91
SpaCy large (3.4.0)TP168.9 (2.6)124.7 (6.3)*170.7 (2.5)130.8 (4.1)*167.4 (3.3)111.0 (5.6)*152.6 (4.4)*
FN11.1 (2.6)55.3 (6.3)*9.3 (2.5)49.2 (4.1)*12.6 (3.3)69.0 (5.6)*27.5 (4.4)*
FP27.0 (2.2)33.4 (3.2)*26.5 (1.7)33.7 (3.7)*27.8 (2.4)33.3 (3.1)*29.0 (1.4)*
PPV0.860.790.870.800.860.770.84
TPR0.940.690.950.730.930.620.85
SpaCy medium (3.4.0)TP169.2 (2.9)105.0 (7.2)*169.9 (2.2)118.3 (7.9)*167.6 (3.3)91.2 (4.6)*138.0 (5.0)*
FN10.9 (2.9)75.1 (7.2)*10.1 (2.2)61.7 (7.9)*12.5 (3.3)88.8 (4.6)*42.1 (5.0)*
FP29.6 (1.8)41.6 (5.5)*29.3 (1.8)41.1 (4.5)*30.6 (3.0)44.0 (3.6)*34.7 (2.4)*
PPV0.850.720.850.740.850.670.8
TPR0.940.580.940.660.930.510.77
SpaCy small (3.4.0)TP131.3 (5.2)96.8 (6.6)*128.5 (5.8)95.9 (5.6)*133.2 (5.6)93.3 (6.7)*117.1 (7.4)*
FN48.7 (5.2)83.3 (6.6)*51.5 (5.8)84.1 (5.6)*46.9 (5.6)86.8 (6.7)*63.0 (7.4)*
FP44.3 (2.0)46.2 (1.7)*44.8 (1.5)46.3 (2.0)*44.8 (2.1)44.9 (2.3)44.6 (2.1)
PPV0.750.680.740.670.750.680.72
TPR0.730.540.710.530.740.520.65
PolyglotTP150.0 (4.3)37.3 (4.4)*155.6 (4.8)*41.4 (4.2)*140.5 (5.9)*28.2 (4.7)*113.3 (5.4)*
FN30.1 (4.3)142.8 (4.4)*24.5 (4.8)*138.7 (4.2)*39.5 (5.9)*151.9 (4.7)*66.8 (5.4)*
FP58.5 (3.5)85.7 (6.6)*56.2 (4.9)86.5 (5.7)*63.1 (3.6)*84.4 (5.7)*74.2 (4.4)*
PPV0.720.300.730.320.690.250.60
TPR0.830.210.860.230.780.160.63

The column ‘Majority All’ names is considered the baseline for the tests on minority, women’s, men’s, and unisex names.

*

denotes that the result is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values in parentheses denote the standard deviation. PPV and TPR are calculated on the basis of the average confusion counts. As the differences between the calculated PPV and TPR values are greater than 0.05, we conclude that these models are unfair.

Table 3.

Unfair models: Average confusion counts for the unfair models on NER on PERSON entities.

ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
DaCy medium (0.1.0)TP170.1 (3.1)153.4 (4.8)*171.6 (2.7)153.7 (2.9)*171.2 (2.4)157.2 (3.9)*162.2 (3.2)*
FN9.9 (3.1)26.6 (4.8)*8.5 (2.7)26.4 (2.9)*8.8 (2.4)22.9 (3.9)*17.8 (3.2)*
FP23.1 (2.3)26.0 (3.2)*23.3 (2.6)26.0 (3.2)*23.1 (2.0)23.7 (2.5)24.5 (2.0)
PPV0.880.860.880.860.880.870.87
TPR0.950.850.950.850.950.870.90
DaCy small (0.1.0)TP163.8 (3.1)151.8 (4.7)*164.2 (3.3)153.7 (5.5)*164.8 (2.8)152.6 (4.3)*157.3 (4.1)*
FN16.2 (3.1)28.3 (4.7)*15.9 (3.3)26.4 (5.5)*15.2 (2.8)27.5 (4.3)*22.7 (4.1)*
FP28.1 (3.5)29.9 (4.0)27.7 (3.0)31.1 (3.9)28.0 (2.7)29.6 (3.5)27.2 (2.5)
PPV0.850.840.860.830.850.840.85
TPR0.910.840.910.850.920.850.87
DaNLP BERTTP174.3 (2.0)162.3 (2.7)*174.8 (2.1)163.3 (3.6)*174.4 (1.6)162.6 (3.8)*166.7 (2.3)*
FN5.7 (2.0)17.8 (2.7)*5.3 (2.1)16.7 (3.6)*5.7 (1.6)17.5 (3.8)*13.4 (2.3)*
FP21.5 (2.0)25.9 (2.6)*21.4 (2.0)25.4 (3.3)*21.6 (1.5)25.6 (2.5)*24.2 (2.2)*
PPV0.890.860.890.870.890.860.87
TPR0.970.900.970.910.970.900.93
FlairTP172.9 (1.7)162.8 (3.0)*175.0 (1.4)*164.4 (2.5)*171.4 (2.9)161.9 (3.9)*163.6 (3.4)*
FN7.2 (1.7)17.3 (3.0)*5.1 (1.4)*15.6 (2.5)*8.6 (2.9)18.1 (3.9)*16.5 (3.4)*
FP19.0 (1.3)18.7 (1.8)18.5 (1.5)19.1 (1.9)19.0 (1.7)19.2 (1.9)20.6 (1.9)*
PPV0.900.900.900.900.900.890.89
TPR0.960.900.970.910.950.900.91
SpaCy large (3.4.0)TP168.9 (2.6)124.7 (6.3)*170.7 (2.5)130.8 (4.1)*167.4 (3.3)111.0 (5.6)*152.6 (4.4)*
FN11.1 (2.6)55.3 (6.3)*9.3 (2.5)49.2 (4.1)*12.6 (3.3)69.0 (5.6)*27.5 (4.4)*
FP27.0 (2.2)33.4 (3.2)*26.5 (1.7)33.7 (3.7)*27.8 (2.4)33.3 (3.1)*29.0 (1.4)*
PPV0.860.790.870.800.860.770.84
TPR0.940.690.950.730.930.620.85
SpaCy medium (3.4.0)TP169.2 (2.9)105.0 (7.2)*169.9 (2.2)118.3 (7.9)*167.6 (3.3)91.2 (4.6)*138.0 (5.0)*
FN10.9 (2.9)75.1 (7.2)*10.1 (2.2)61.7 (7.9)*12.5 (3.3)88.8 (4.6)*42.1 (5.0)*
FP29.6 (1.8)41.6 (5.5)*29.3 (1.8)41.1 (4.5)*30.6 (3.0)44.0 (3.6)*34.7 (2.4)*
PPV0.850.720.850.740.850.670.8
TPR0.940.580.940.660.930.510.77
SpaCy small (3.4.0)TP131.3 (5.2)96.8 (6.6)*128.5 (5.8)95.9 (5.6)*133.2 (5.6)93.3 (6.7)*117.1 (7.4)*
FN48.7 (5.2)83.3 (6.6)*51.5 (5.8)84.1 (5.6)*46.9 (5.6)86.8 (6.7)*63.0 (7.4)*
FP44.3 (2.0)46.2 (1.7)*44.8 (1.5)46.3 (2.0)*44.8 (2.1)44.9 (2.3)44.6 (2.1)
PPV0.750.680.740.670.750.680.72
TPR0.730.540.710.530.740.520.65
PolyglotTP150.0 (4.3)37.3 (4.4)*155.6 (4.8)*41.4 (4.2)*140.5 (5.9)*28.2 (4.7)*113.3 (5.4)*
FN30.1 (4.3)142.8 (4.4)*24.5 (4.8)*138.7 (4.2)*39.5 (5.9)*151.9 (4.7)*66.8 (5.4)*
FP58.5 (3.5)85.7 (6.6)*56.2 (4.9)86.5 (5.7)*63.1 (3.6)*84.4 (5.7)*74.2 (4.4)*
PPV0.720.300.730.320.690.250.60
TPR0.830.210.860.230.780.160.63
ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
DaCy medium (0.1.0)TP170.1 (3.1)153.4 (4.8)*171.6 (2.7)153.7 (2.9)*171.2 (2.4)157.2 (3.9)*162.2 (3.2)*
FN9.9 (3.1)26.6 (4.8)*8.5 (2.7)26.4 (2.9)*8.8 (2.4)22.9 (3.9)*17.8 (3.2)*
FP23.1 (2.3)26.0 (3.2)*23.3 (2.6)26.0 (3.2)*23.1 (2.0)23.7 (2.5)24.5 (2.0)
PPV0.880.860.880.860.880.870.87
TPR0.950.850.950.850.950.870.90
DaCy small (0.1.0)TP163.8 (3.1)151.8 (4.7)*164.2 (3.3)153.7 (5.5)*164.8 (2.8)152.6 (4.3)*157.3 (4.1)*
FN16.2 (3.1)28.3 (4.7)*15.9 (3.3)26.4 (5.5)*15.2 (2.8)27.5 (4.3)*22.7 (4.1)*
FP28.1 (3.5)29.9 (4.0)27.7 (3.0)31.1 (3.9)28.0 (2.7)29.6 (3.5)27.2 (2.5)
PPV0.850.840.860.830.850.840.85
TPR0.910.840.910.850.920.850.87
DaNLP BERTTP174.3 (2.0)162.3 (2.7)*174.8 (2.1)163.3 (3.6)*174.4 (1.6)162.6 (3.8)*166.7 (2.3)*
FN5.7 (2.0)17.8 (2.7)*5.3 (2.1)16.7 (3.6)*5.7 (1.6)17.5 (3.8)*13.4 (2.3)*
FP21.5 (2.0)25.9 (2.6)*21.4 (2.0)25.4 (3.3)*21.6 (1.5)25.6 (2.5)*24.2 (2.2)*
PPV0.890.860.890.870.890.860.87
TPR0.970.900.970.910.970.900.93
FlairTP172.9 (1.7)162.8 (3.0)*175.0 (1.4)*164.4 (2.5)*171.4 (2.9)161.9 (3.9)*163.6 (3.4)*
FN7.2 (1.7)17.3 (3.0)*5.1 (1.4)*15.6 (2.5)*8.6 (2.9)18.1 (3.9)*16.5 (3.4)*
FP19.0 (1.3)18.7 (1.8)18.5 (1.5)19.1 (1.9)19.0 (1.7)19.2 (1.9)20.6 (1.9)*
PPV0.900.900.900.900.900.890.89
TPR0.960.900.970.910.950.900.91
SpaCy large (3.4.0)TP168.9 (2.6)124.7 (6.3)*170.7 (2.5)130.8 (4.1)*167.4 (3.3)111.0 (5.6)*152.6 (4.4)*
FN11.1 (2.6)55.3 (6.3)*9.3 (2.5)49.2 (4.1)*12.6 (3.3)69.0 (5.6)*27.5 (4.4)*
FP27.0 (2.2)33.4 (3.2)*26.5 (1.7)33.7 (3.7)*27.8 (2.4)33.3 (3.1)*29.0 (1.4)*
PPV0.860.790.870.800.860.770.84
TPR0.940.690.950.730.930.620.85
SpaCy medium (3.4.0)TP169.2 (2.9)105.0 (7.2)*169.9 (2.2)118.3 (7.9)*167.6 (3.3)91.2 (4.6)*138.0 (5.0)*
FN10.9 (2.9)75.1 (7.2)*10.1 (2.2)61.7 (7.9)*12.5 (3.3)88.8 (4.6)*42.1 (5.0)*
FP29.6 (1.8)41.6 (5.5)*29.3 (1.8)41.1 (4.5)*30.6 (3.0)44.0 (3.6)*34.7 (2.4)*
PPV0.850.720.850.740.850.670.8
TPR0.940.580.940.660.930.510.77
SpaCy small (3.4.0)TP131.3 (5.2)96.8 (6.6)*128.5 (5.8)95.9 (5.6)*133.2 (5.6)93.3 (6.7)*117.1 (7.4)*
FN48.7 (5.2)83.3 (6.6)*51.5 (5.8)84.1 (5.6)*46.9 (5.6)86.8 (6.7)*63.0 (7.4)*
FP44.3 (2.0)46.2 (1.7)*44.8 (1.5)46.3 (2.0)*44.8 (2.1)44.9 (2.3)44.6 (2.1)
PPV0.750.680.740.670.750.680.72
TPR0.730.540.710.530.740.520.65
PolyglotTP150.0 (4.3)37.3 (4.4)*155.6 (4.8)*41.4 (4.2)*140.5 (5.9)*28.2 (4.7)*113.3 (5.4)*
FN30.1 (4.3)142.8 (4.4)*24.5 (4.8)*138.7 (4.2)*39.5 (5.9)*151.9 (4.7)*66.8 (5.4)*
FP58.5 (3.5)85.7 (6.6)*56.2 (4.9)86.5 (5.7)*63.1 (3.6)*84.4 (5.7)*74.2 (4.4)*
PPV0.720.300.730.320.690.250.60
TPR0.830.210.860.230.780.160.63

The column ‘Majority All’ names is considered the baseline for the tests on minority, women’s, men’s, and unisex names.

*

denotes that the result is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values in parentheses denote the standard deviation. PPV and TPR are calculated on the basis of the average confusion counts. As the differences between the calculated PPV and TPR values are greater than 0.05, we conclude that these models are unfair.

Table 4.

Fair models: Average confusion counts for the fair models on NER on PERSON entities.

ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
ScandiNERTP177.1 (1.7)172.9 (2.3)*176.4 (1.8)173.1 (2.5)*176.8 (1.4)173.3 (1.7)*173.8 (2.1)*
FN2.9 (1.7)7.1 (2.3)*3.7 (1.8)6.9 (2.5)*3.3 (1.4)6.8 (1.7)*6.3 (2.1)*
FP8.2 (1.5)6.8 (1.4)*8.2 (1.2)7.7 (1.4)8.2 (1.5)6.8 (1.4)*8.1 (1.3)
PPV0.960.960.960.960.960.960.96
TPR0.980.960.980.960.980.960.97
DaCy large (0.1.0)TP176.9 (1.4)174.4 (1.6)*176.4 (1.5)175.1 (2.4)176.7 (1.8)174.5 (1.6)*174.3 (1.6)*
FN3.2 (1.4)5.7 (1.6)*3.7 (1.5)4.9 (2.4)3.4 (1.8)5.5 (1.6)*5.8 (1.6)*
FP17.6 (1.4)17.9 (1.6)18.5 (1.8)17.7 (1.4)17.6 (0.9)17.1 (0.8)18.1 (1.4)
PPV0.910.910.910.910.910.910.91
TPR0.980.970.980.970.980.970.97
ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
ScandiNERTP177.1 (1.7)172.9 (2.3)*176.4 (1.8)173.1 (2.5)*176.8 (1.4)173.3 (1.7)*173.8 (2.1)*
FN2.9 (1.7)7.1 (2.3)*3.7 (1.8)6.9 (2.5)*3.3 (1.4)6.8 (1.7)*6.3 (2.1)*
FP8.2 (1.5)6.8 (1.4)*8.2 (1.2)7.7 (1.4)8.2 (1.5)6.8 (1.4)*8.1 (1.3)
PPV0.960.960.960.960.960.960.96
TPR0.980.960.980.960.980.960.97
DaCy large (0.1.0)TP176.9 (1.4)174.4 (1.6)*176.4 (1.5)175.1 (2.4)176.7 (1.8)174.5 (1.6)*174.3 (1.6)*
FN3.2 (1.4)5.7 (1.6)*3.7 (1.5)4.9 (2.4)3.4 (1.8)5.5 (1.6)*5.8 (1.6)*
FP17.6 (1.4)17.9 (1.6)18.5 (1.8)17.7 (1.4)17.6 (0.9)17.1 (0.8)18.1 (1.4)
PPV0.910.910.910.910.910.910.91
TPR0.980.970.980.970.980.970.97

The column ‘Majority All’ names is considered the baseline for the tests on minority, women’s, men’s, and unisex names.

*

denotes that the result is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values in parentheses denote the standard deviation. PPV and TPR are calculated on the basis of the average confusion counts. As the differences in the calculated PPV and TPR values are less than 0.05, we conclude that these models are fair.

Table 4.

Fair models: Average confusion counts for the fair models on NER on PERSON entities.

ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
ScandiNERTP177.1 (1.7)172.9 (2.3)*176.4 (1.8)173.1 (2.5)*176.8 (1.4)173.3 (1.7)*173.8 (2.1)*
FN2.9 (1.7)7.1 (2.3)*3.7 (1.8)6.9 (2.5)*3.3 (1.4)6.8 (1.7)*6.3 (2.1)*
FP8.2 (1.5)6.8 (1.4)*8.2 (1.2)7.7 (1.4)8.2 (1.5)6.8 (1.4)*8.1 (1.3)
PPV0.960.960.960.960.960.960.96
TPR0.980.960.980.960.980.960.97
DaCy large (0.1.0)TP176.9 (1.4)174.4 (1.6)*176.4 (1.5)175.1 (2.4)176.7 (1.8)174.5 (1.6)*174.3 (1.6)*
FN3.2 (1.4)5.7 (1.6)*3.7 (1.5)4.9 (2.4)3.4 (1.8)5.5 (1.6)*5.8 (1.6)*
FP17.6 (1.4)17.9 (1.6)18.5 (1.8)17.7 (1.4)17.6 (0.9)17.1 (0.8)18.1 (1.4)
PPV0.910.910.910.910.910.910.91
TPR0.980.970.980.970.980.970.97
ModelMetricMajority allMinority allMajority menMinority menMajority womenMinority womenUnisex
ScandiNERTP177.1 (1.7)172.9 (2.3)*176.4 (1.8)173.1 (2.5)*176.8 (1.4)173.3 (1.7)*173.8 (2.1)*
FN2.9 (1.7)7.1 (2.3)*3.7 (1.8)6.9 (2.5)*3.3 (1.4)6.8 (1.7)*6.3 (2.1)*
FP8.2 (1.5)6.8 (1.4)*8.2 (1.2)7.7 (1.4)8.2 (1.5)6.8 (1.4)*8.1 (1.3)
PPV0.960.960.960.960.960.960.96
TPR0.980.960.980.960.980.960.97
DaCy large (0.1.0)TP176.9 (1.4)174.4 (1.6)*176.4 (1.5)175.1 (2.4)176.7 (1.8)174.5 (1.6)*174.3 (1.6)*
FN3.2 (1.4)5.7 (1.6)*3.7 (1.5)4.9 (2.4)3.4 (1.8)5.5 (1.6)*5.8 (1.6)*
FP17.6 (1.4)17.9 (1.6)18.5 (1.8)17.7 (1.4)17.6 (0.9)17.1 (0.8)18.1 (1.4)
PPV0.910.910.910.910.910.910.91
TPR0.980.970.980.970.980.970.97

The column ‘Majority All’ names is considered the baseline for the tests on minority, women’s, men’s, and unisex names.

*

denotes that the result is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values in parentheses denote the standard deviation. PPV and TPR are calculated on the basis of the average confusion counts. As the differences in the calculated PPV and TPR values are less than 0.05, we conclude that these models are fair.

5.2 Epistemic consequences

Representational biases, as those examined in this work, are not only inherently harmful in sustaining and reinforcing existing discrimination, but they furthermore affect knowledge production. When research in the humanities is conducted on the basis of technological tools and computational methods, the power to decide who and what is included can potentially shift away from archivists and researchers and onto the technological configurations. Hence, seeing datasets as loci of power calls for critical reflections upon the use NLP frameworks to organize, structure, and present data in research contexts.

From our results, we draw two epistemic consequences. First, the use of unfair tools can lead to research results misrepresenting the studied phenomenon. One example is unequal performance rates of automatic extraction tools in network analysis leading to certain groups being excluded from the final analysis. Second, moving beyond the purely scholarly perspective, such machine-inflicted biases can contribute to an incomplete representation of culture and biased formation of shared memory and cultural understanding. For example, if our shared understanding of how minorities were represented in Danish newspapers in the past relies on analysis built on NER frameworks exhibiting significant performance degradation when dealing with minority names, it can create biased representations of the past. Therefore, we propose that incorporating algorithmic fairness metrics in the selection of tools for DH can help alleviate the risk of perpetuating silence.

5.3 Limitations

While the results of our study shed light on several important aspects of fairness in NER tasks, it is worth noting that there are some limitations to our approach. One of the main limitations is that we presented results and tested our methodology on a single relatively small Indo-European language. However, it is important to recognize that conducting a meaningful fairness analysis of NLP models depends significantly on the specific linguistic and sociocultural context of the language in question, and we do not regard achieving universal fairness in NLP as a feasible objective. Therefore, we invite researchers to conduct similar analyses for other languages, and we expect that similar results could be obtained for different languages, given an appropriate change of experimental conditions.

Second, our use of proxies for social groups has some limitations to it. On the gender dimension, using gendered name lists corresponding to Danish name laws relies on, and so reinforces, a binary understanding of gender, which may be seen as overly simplistic. Additionally, we only included a proxy for one minority group in Denmark in our experimental pipeline and did not yet consider other minority communities. To address this limitation, future work could involve including names in our experimental pipeline from other minority communities in Denmark, such as Vietnamese or Greenlandic.

A final limitation of our study is the choice of the threshold between fair and unfair models. In this work, we have used a maximum of 0.05 for differences in calculated fairness metrics for a model to be considered fair. Since there are no agreed-upon thresholds for determining when calibration is equal enough, this threshold may appear arbitrary. Nonetheless, this absence of standard practices emphasizes the need for researchers to critically reflect upon how they assess the fairness of the tools they integrate into their scholarly pipeline and how it affects their research results.

6. Conclusions

Our analysis of fairness in Danish NER involved an innovative experimental method built on data augmentation with an analytical perspective rooted in intersectional feminism. This allowed us to test for fairness across subgroups and examine the impact of multiple dimensions of discrimination. In contrast to earlier studies (Lassen et al. 2023), which report average F1 scores, we evaluated the performance disparities based on two metrics of algorithmic fairness: calibration within groups and balance for the positive class. This provided detailed information on how errors are distributed and provided nuanced insights into the performance of individual Danish NER models, their potential biases, and the consequences for those working in DH, more generally.

Our results show two things. First, only the models ScandiNER and DaCy large comply with the fairness criteria; the rest of the models we test (DaCy medium and small, DaNLP BERT, Flair, all SpaCy models, and Polyglot) are shown to be unfair towards certain groups. Data augmentation provides a special case where the groups considered in the fairness analysis have equal base rates, which allows for a more meaningful performance comparison across the different social groups (Kleinberg, Mullainathan, and Raghavan 2017). Hence, we can conclude that the observed performance differences are due to unfair performance differences across social groups. Second, seeing these results through the concept of silencing, we argued that unequal performance across groups can lead to the exclusion and marginalization of certain social groups when voices and experiences are disregarded and silenced.

NER builds on the logic of categorization, which can inherently lead to the experiences of social groups being disregarded or overlooked by them being inappropriately categorized. Hence, silencing, understood as the denying social groups of their voices and representation follows from them being relegated into the residual spaces as they risk being disregarded in further analysis. Whether NER is used as a sub-component of a larger analysis or if the extracted entities constitute the main object of analysis, unfair tools naturally lead to inaccurate results which misrepresent the studied phenomenon. Moving beyond a narrow and purely scholarly perspective, such machine-inflicted biases can contribute to an incomplete representation of culture and biased formation of shared memory and cultural understanding. Incorporating algorithmic fairness metrics in the selection of tools for DH can help alleviate the risk of perpetuating silencing of marginalized voices.

These potential limitations notwithstanding, we have shown first that unfair tools can lead to poor representation of social groups and, second, how the use of such tools has the potential to cause the silencing of marginalized groups in DH research.

Author contributions

Ida Marie S. Lassen (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Visualization, and Writing—original draft), Ross Deans Kristensen-McLachlan (Conceptualization, Methodology, Supervision, and Writing—review & editing), Mina Almasi (Data curation and Software), Kenneth Enevoldsen (Conceptualization, Methodology, and Software), Kristoffer Nielbo (Conceptualization, Formal analysis, Resources, Supervision, and Writing—review & editing)

Supplementary data

Supplementary data is available at DSH online.

Funding

None declared

Footnotes

1

See Ligebehandling for Alle (2021) for citizen proposal for abolishing of the gender-separated name lists including critique and explanations (available only in Danish).

References

Agersnap
A.
et al. (
2022
) ‘
Unveiling the Character Gallery of Sermons: A Social Network Analysis of 11,955 Danish Sermons, English’,
Temenos
,
58
:
119
46
. https://doi.org/10.33356/temenos.100454.

Agersnap
A.
et al. (
2020
) ‘
Sermons as Data: Introducing a Corpus of 11,955 Danish Sermons. English’,
Cultural Analytics
,
12
:
1
27
. ISSN: 2371-4549. https://doi.org/10.22148/001c.18238.

Ahnert
R.
,
Ahnert
S. E.
(
2015
) ‘
Protestant Letter Networks in the Reign of Mary I: A Quantitative Approach’,
ELH
,
82
:
1
1
. https://doi.org/10.1353/elh.2015.0000.

Algee-Hewitt
M. A.
(
2018
) ‘
Distributed Character: Quantitative Models of the English Stage, 1550–1900’,
New Literary History
,
48
:
751
82
.

Basta
C.
,
Costa-jussà
M. R.
,
Casas
N.
(
2019
) ‘Evaluating the Underlying Gender Bias in Contextualized Word Embeddings’, Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp.
33
9
. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-3805.

Bjerring-Hansen
J.
et al. (
2022
) ‘Mending fractured texts. A heuristic procedure for correcting OCR data. English’, Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0); 6th Digital Humanities in the Nordic and Baltic Countries Conference, DHNB 2022; Ceur Workshop Proceedings, pp.
177
86
, Vol. 3232, Conference date: 15-03-2022 Through 18-03-2022. Publisher Copyright: © 2022 Copyright for this paper by its authors.

Blodgett
S. L.
et al. (
2020
) ‘Language (Technology) is Power: A Critical Survey of “Bias” in NLP’, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5454–5476. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485.

Borkan
D.
et al. (
2019
) ‘Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification’, Companion Proceedings of The 2019 World Wide Web Conference, pp. 491-500. San Francisco, USA: Association for Computing Machinery. https://doi.org/10.1145/3308560.3317593.

Boros
E.
et al. (
2020
) ‘Alleviating digitization errors in named entity recognition for historical documents’, Proceedings of the 24th Conference on Computational Natural Language Learning, pp.
431
41
. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.conll-1.35.

Butler
J.
(
2006
)
Gender Trouble: Feminism and the Subversion of Identity
.
Abingdon
:
Routledge
.

Carter
R. G.S.
(
2006
) ‘
Of Things Said and Unsaid: Power, Archival Silences, and Power in Silence’,
Archivaria
,
61
:
215
33
.

Crawford
K.
(
2017
) The Trouble with Bias - NIPS 2017 Keynote - Kate Crawford #NIPS2017. [Online; accessed 18-February-2023], published by The Artificial Intelligence Channel, youtube.com

Crenshaw
K. W.
(
2013
) ‘Mapping the Margins: Intersectionality, Identity Politics, and Violence against Women of Color’, in
The Public Nature of Private Violence
,
93
118
.
Abingdon
:
Routledge
.

Czarnowska
P.
,
Vyas
Y.
,
Shah
K.
(2021)
Quantifying Social Biases in NLP: A Generalization and Empirical Comparison of Extrinsic Fairness Metrics
.
Transactions of the Association for Computational Linguistics
.
9
,
1249
1267
.

Dahl
M.
,
Krog
N.
(
2018
) ‘
Experimental Evidence of Discrimination in the Labour Market: Intersections between Ethnicity, Gender, and Socio-economic Status’,
European Sociological Review
,
34
:
402
17
.

Dev
S.
et al. (
2021
) ‘Harms of gender exclusivity and challenges in non-binary representation in language technologies’, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.
1968
94
. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.150.

Dwork
C.
et al. (
2012
) ‘Fairness through awareness’, Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ITCS ’12, pp.
214
26
. Cambridge, MA: Association for Computing Machinery. isbn: 9781450311151. https://doi.org/10.1145/2090236.2090255.

Ehrmann
M.
et al. (
2016
) ‘Diachronic evaluation of NER systems on old newspapers’, 13th Conference on Natural Language Processing, pp.
97
107
. Bochum, Germany: Bochumer Linguistische Arbeitsberichte.

Enevoldsen
K.
(
2022
) Augmenty: The Cherry on Top of Your NLP Pipeline. Version 1.0.1. https://doi.org/10.5281/zenodo.6675315.

Enevoldsen
K.
,
Hansen
L.
,
Nielbo
K.
(
2021
) ‘DaCy: A Unified Framework for Danish NLP’, Proceedings of the Computational Humanities Research Conference 2021. Amsterdam, Netherlands: CEUR-WS.org.

European Union Agency for Fundamental Rights
. (
2014
)
Violence against Women: An EU-wide Survey
.
Technical report
.
Vienna, Austria:
European Union Agency for Fundamental Rights
.

Field
A.
, et al. (
2021
) ‘A survey of race, racism, and anti-racism in NLP’, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.
1905
25
. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.149.

Fischer
F.
,
Skorinkin
D.
(
2021
) “Social Network Analysis in Russian Literary Studies.” in
Jan
F.
,
David
C. M.
,
Richard
R. N.
(eds)
The Palgrave Handbook of Digital Russia Studies
,
Chapter
.
29
, pp.
517
36
.
London:
Palgrave Macmillan
.

Fricker
M.
(
2007
)
Epistemic Injustice: Power and the Ethics of Knowing
.
Oxford:
Oxford University Press
.

Friedman
B.
,
Nissenbaum
H.
(
1996
) ‘
Bias in Computer Systems’,
ACM Transactions on Information Systems (TOIS)
,
14
:
330
47
.

Gallen
Y.
,
Lesner
R. V.
,
Vejlin
R.
(
2019
) ‘
The Labor Market Gender Gap in Denmark: Sorting Out the Past 30 Years’,
Labour Economics
,
56
:
58
67
.

Garg
N.
et al. (
2018
) ‘
Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes’,
Proceedings of the National Academy of Sciences
,
115
:
E3635
44
.

Gaut
A.
et al. (
2020
) ‘Towards understanding gender bias in relation extraction’, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.
2943
53
. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.265.

Hardt
M.
,
Price
E.
,
Srebro
N.
(
2016
) Equality of Opportunity in Supervised Learning, Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3323–3331. Barcelona, Spain: Advances in Neural Information Processing Systems.

Hedden,
B. (2021).
'
On statistical criteria of algorithmic fairness'
.
Philosophy & Public Affairs
.
49
(
2
)
209
231
. Mar 2021. https://doi.org/10.1111/papa.12189.

Herbelot
A.
,
von Redecker
E.
,
Müller
J.
(
2012
) ‘Distributional Techniques for Philosophical Enquiry’,
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
, pp.
45
54
.
Avignon, France
:
Association for Computational Linguistics
.

Hvingelby
R.
et al. (
2020
) ‘DaNE: a named entity resource for danish’, Proceedings of the 12th Language Resources and Evaluation Conference. LREC ’20, pp.
4597
604
. Marseille, France: European Language Resources Association.

Jørgensen
R. F.
(
2023
) ‘
Data and Rights in the Digital Welfare State: The Case of Denmark’,
Information, Communication & Society
,
26
:
123
38
.

Keyes
O.
(
2018
) ‘The misgendering machines: Trans/HCI implications of automatic gender recognition’, Proceedings of the ACM on Human-computer Interaction 2.CSCW, pp.
1
22
. New York, NY, USA: Association for Computing Machinery

Khosravi
S.
(
2012
) ‘
White Masks/Muslim Names: Immigrants and Name-changing in Sweden’,
Race & Class
,
53
:
65
80
.

Kleinberg
J.
,
Mullainathan
S.
,
Raghavan
M.
(
2017
) ‘Inherent trade-offs in the fair determination of risk scores’, In
Papadimitriou
Christos H.
(eds.) 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Vol. 67. Leibniz International Proceedings in Informatics (LIPIcs), pp.
43:1
43:23
. Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. isbn: 978-3-95977-029-3. https://doi.org/10.4230/LIPIcs.ITCS.2017.43.

Kurita
K.
et al. (
2019
) ‘Measuring bias in contextualized word representations’, Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp.
166
72
. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-3823.

Ladegaard
J.
,
Kristensen-McLachlan
R.D.
(
2023
) ‘
Prodigal Heirs and Their Social Networks in Early Modern English Drama, 1590–1640’,
Law & Literature
,
35
:
31
53
. https://doi.org/10.1080/1535685X.2021.1902635.

Lalor
J. P.
et al. (
2022
) ‘Benchmarking intersectional biases in NLP’, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.
3598
609
. Seattle, United States: Association for Computational Linguistics.

Larson Surya Mattu
J.
,
Kirchner
L.
,
Angwin
J.
(
2016
) ‘How We Analyzed the COMPAS Recidivism Algorithm’, https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm (visited on 05/15/2023).

Lassen
I. M. S.
et al. (
2023
) ‘Detecting intersectionality in NER models: A data-driven approach’, Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. Dubrovnik, Croatia: International Conference on Computational Linguistics.

Ligebehandling for Alle
. (
2021
) Ligebehandling for Alle: Afskaf de kønsopdelte Navnelister. [Online; accessed 18-February-2023], published by borgerforslag.dk.

Lu
K.
et al. (
2020
) ‘Gender Bias in Neural Natural Language Processing’, Logic, Language, and Security: Essays Dedicated to Andre Scedrov on the Occasion of His 65th Birthday, pp.
189
202
. Cham, Switzerland: Springer Nature.

Manjavacas
E.
,
Fonteyn
L.
(
2022
) ‘
Adapting vs. Pre-training Language Models for Historical Languages’,
Journal of Data Mining & Digital Humanities
, vol.NLP4DH , pp. 1-19. https://doi.org/10.46298/jdmdh.9152.

Mannov
J.
(
2021
) Fakta Om Hadforbrydelser. [Online; accessed 31-January-2023]. København, Denmark: The Danish Crime Prevention Council.

Manzini
T.
et al. (
2019
) ‘Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings’, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.
615
21
. Minneapolis, MN: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1062.

Meldgaard
E. V.
(
2005
)
Muslimske Fornavne i Danmark
.
København, Denmark
:
Københavns Universitet
.

Nadeem
M.
,
Bethke
A.
,
Reddy
S.
(
2021
) ‘StereoSet: measuring stereotypical bias in pretrained language models’, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.
5356
71
. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.416.

Ranchordás
S.
,
Scarcella
L.
(
2021
) ‘
Automated Government for Vulnerable Citizens: Intermediating Rights’,
William & Mary Bill of Rights Journal
,
30
:
373
.

Schweter
S.
,
Baiter
J.
(
2019
) ‘Towards robust named entity recognition for historic German’, Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp.
96
103
. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4312.

Schweter
S.
et al. (
2022
) ‘hmBERT: historical multilingual language models for named entity recognition’, Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, pp. 1109–29. Bologna, Italy: CEUR-WS.org.

Shah
D.
,
Schwartz
H. A.
,
Hovy
D.
(
2019
) ‘Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview’, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5248–5264. Online: Association for Computational Linguistics.

Shahsavari
S.
et al. (
2020
). ‘
Conspiracy in the Time of Corona: Automatic Detection of Emerging COVID-19 Conspiracy Theories in Social Media and the News
,
Journal of computational social science
3
:
279
317
. https://doi.org/10.1007/s42001-020-00086-5.

Sheng
E.
et al. (
2021
). Societal Biases in Language Generation: Progress and Challenges. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp.
4275
4293
. https://doi.org/10.18653/v1/2021.acl-long.330.

Star
S. L.
,
Bowker
G. C.
(
2007
) ‘
Enacting Silence: Residual Categories as a Challenge for Ethics, Information Systems, and Communication’,
Ethics and Information Technology
,
9
:
273
80
.

Statistics Denmark
. (
2022
). Fakta om Indvandrere Og Efterkommere i Danmark. [Online; accessed 31-January-2023]. Denmark: Statistics Denmark.

Subramanian
S.
et al. (
2021
) ‘Evaluating debiasing techniques for intersectional biases’, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.
2492
8
. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.193.

Tangherlini
T. R.
et al. (
2020
) ‘
An Automated Pipeline for the Discovery of Conspiracy and Conspiracy Theory Narrative Frameworks: Bridgegate, Pizzagate and Storytelling on the Web
,
PLoS One
,
15
:
e0233879
. https://doi.org/10.1371/journal.pone.0233879.

Verma
S.
,
Rubin
J.
(
2018
) ‘Fairness definitions explained’, in Proceedings of the International Workshop on Software Fairness, pp.
1
7
. New York, NY, USA: Association for Computing Machinery.

Vinding
N. V.
(
2020
) ‘Discrimination of Muslims in Denmark’, in
State, Religion and Muslims
, pp.
144
96
.
Leiden, Netherlands:
Brill
.

Zafar
M. B.
et al. (
2017
) ‘Fairness Constraints: Mechanisms for Fair Classification’, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp.
962
70
. Fort Lauderdale, Florida, USA:
PMLR
.

Zhao
J.
et al. (
2018
) ‘Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods’, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 15-20. New Orleans, Louisiana. Association for Computational Linguistics.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)

Supplementary data