Digital humanities, knowledge complexity, and the five ‘aporias’ of digital research

This article introduces a frame of reference for understanding the fundamental challenges that inform digital humanities as an interdisciplinary research area between arts, humanities, information, and computer science. Its conclusions are basedupontheevidencebasedevelopedwithinanEU-fundedcollaborationknown as Knowledge Complexity, or KPLEX for short (www.kplex-project.eu), in particular via the project’s thirty-eight linked interviews about big dataresearch. When viewed from the perspective of the digital humanities, ﬁve distinct points of ‘aporia’ with a signiﬁcant impact on digital humanities (DH) appear in this corpus, places where the interviewees explicitly or tacitly expose gulfs between the epistemic cultures that contribute to DH and that create tensions between these disciplines, even as they seek to collaborate. This article will explore these areas of apparent irreconcilability, andconcludewithaseries ofreﬂectionson howdigital humanities researchers might build upon their unique competency proﬁle to negotiate within these critical conversations, in particular in the framework of the emerging subﬁeld of critical digital humanities.


Introduction
The idea that Digital Humanities practitioners might provide a translational capacity within and between the arts, humanities, information, and computer science, easing collaboration between these disciplines and enhancing shared results, is not a new one: in fact, there is a long tradition of conceptualizing at least some digital humanists as 'intermediaries', (Edmond, 2005) 'translators' (Siemens et al., 2011) or 'hybrid people' (Liu et al., 2007;Lutz et al., 2008cited in Siemens et al., 2011. As the long-predicted 'postdigital' mainstreaming of digital humanities and digital methods into arts and humanities research advances, we might expect the continuation of this transformation of the digital humanities from a disruptive force to a supportive one. Furthermore, while some within the academy certainly view the potential industrial relevance of the digital humanities with suspicion (Allington et al., 2016), there are perhaps even more voices from industry itself calling for the development of a more humanistic, critical dimension in the work of the ICT industry (Hartley, 2017;Madsbjerg, 2017;Hern, 2018; Centre for Humane Technology, 2019; The Copenhagen Letter, 2019).
While it may therefore seem timely to explore, as Liu (2012Liu ( , 2016 has called for, how the digital humanities might deliver a linchpin set of critical competencies for and reflections on the techno-social or techno-cultural interface, how this intervention into technology development might resonate with the originating tenets of digital humanities (DH) as a field in which technological methods are applied to the interrogation of humanities materials remains unclear. This article will introduce such a frame of reference by exploring the implications for digital humanities to be found in a corpus of thirty-eight interviews about big data research, interpretation of which was supplemented with a comprehensive literature review, targeted surveys, and a data-mining exercise. The project that developed this material, an EU-funded collaboration known as Knowledge Complexity, or KPLEX (2019) for short (www. kplex-project.eu), explored in depth the perspectives of and attitudes toward data, and big data, in particular, found among the epistemic and professional communities the digital humanities bring together, in spite of the fact that it was not actually conceived of as a digital humanities project per se. Instead, the work was sponsored under the European Commission's 'Big Data Public Private Partnership' as an innovative response to questions of the biases and blindspots that a cultural studies approach (as opposed to via more common methodologies based in the practices of e.g. Human-Computer Interaction (HCI) or Science and Technology Studies, (STS)) might reveal within current and future data-centric research agendae. In the process of pursuing this goal, the project's results exposed a number of fundamental issues about the ecosystem of applied data-driven research and how actors within this system create data, use them (or misuse them), and the nature of the resulting research results based on them. It also exposed the depth of the misalignment between approaches to how knowledge is generated and validated across collaborating disciplines, and the values that underpin them. In this way, although the DH community was originally not viewed as a primary audience for the project (only two of the eight primary team members would tend to self-identify as digital humanists) nor were DH researchers sought out explicitly to contribute to its interviews, the project's evidence corpus throws significant light on to the environment that surrounds DH, and potential future interventions into a technologized society and economy its adherents might make. Indeed, what the interview corpus reveals perhaps more than anything else is a yawning cultural divide, seemingly impossible to gulf, between the creators of software tools and their intended users. This divide, which we characterize as a series of aporias, or points at which mutual understanding and convergence seem nearly impossible, is one that lies at the heart of the challenge of the digital humanities, in particular in those aspects of its pursuits where data form the object of study, and therefore places where the digital humanities knowledge base may be particularly applicable, if not indeed urgently required, in contexts very far removed from their traditional base. The evidence the project produced therefore offers much food for thought to those who identify as digital humanists, as it points toward a number of key barriers widely faced across the landscape of data science that are perhaps optimally negotiated within our hybrid research space, where the applied and the theoretical, the 'hack' and 'yack', complement each other in highly relevant ways.
KPLEX was structured according to three major research questions, each of which was framed according to variations on a central project methodology using tools from social sciences (primarily surveys and interviews) but interpreted from the perspective of humanities methods. In specific, the project aimed first to develop an understanding of how computer scientists view and communicate about the data they work with (WP2). Second, the project sought to come to grips with how the dissociation of the data set from its origins in individualized processes of information gathering impacts upon the development of research processes (WP3). The final aim was to look at research approaches and the data they create so as to understand how complex data sets are shaped by the interpretive minds that create them (WP4). The teams behind these three strands each chose a reference population to study (computer scientists, professionals based in collection holding institutions, and researchers in the very interdisciplinary community that studies J. Edmond and J. Lehmann Digital Scholarship in the Humanities, Vol. 36, Supplement 2, 2021 ii96 human emotion, respectively) and created linked, but distinct, research instruments (in particular interview questionnaires) to uncover attitudes and practices related to these issues. 1 When viewed from the perspective of the KPLEX project's resulting data, five distinct points of the aforementioned aporia seemed to arise, places where the interviewees explicitly or tacitly exposed gulfs in epistemic culture demonstrating fundamental and often conflicting stances regarding data and knowledge in the landscape they navigate. The KPLEX interviews clearly illustrate the embeddedness of these challenges in the foundations of the disciplines that contribute to DH, as well as the strategies the community has developed and deployed. The issues discussed in this article are buried deep within the 'subconscious' and values of research and professional communities regarding their work. They form a tacit layer that may create barriers that go unrecognized for a very long time (even within a close collaboration), if they are ever surfaced and consciously considered at all. The further entanglement with professional identities and values raises them above the level of mere management challenges to a status where a more fundamental reconsideration of the scholarship produced within such collaborations may be required. In these fundamentals we may find future avenues for DH to grow in its own right, but also to expand and reconsider its potential impact beyond humanities research questions and the development of research technology, a topic that will be revisited in the conclusion section of this article.

Language Matters and Communication Can Falter When This Is Taken for Granted
The first unreconciled gap the KPLEX project explored was centred around the question of whether key terms required for the discussion of interdisciplinary data-centric research and development were understood similarly by different contributing communities, and, if not, if strategies were in place to resolve the differences in usage and mitigate risks that might arise from misunderstandings. This is not in itself a new question to ask, with prominent moments in its evolution from Star and Griesemer's (1989) conceptualization of 'boundary objects' to Strasser and Edward's (2017) placing of data in the transformations from nature to knowledge to Galison's understanding of how 'interlanguages' (such as jargons, pigeons, and creoles) operate in the 'trading zones' of science (Galison, 2010) and beyond to Leonelli's definition of data as 'portable objects' able to provide evidence (Leonelli, 2015, p. 817). In spite of this large base of literature on how communication can travel across epistemic cultures, however, the experience of large-scale digital humanities projects can also be that it either does not, or certainly does not do so smoothly or easily (indeed it was such an experience in a project that initially gave rise to many ideas that would later frame the KPLEX project). In particular, the interviews with computer scientists showed a reluctance to discussing what certain key terms might mean or imply, a lack of precision that would surely draw criticism in a purely humanities context. This impulse weakens the potential for self-reflection in computer science but also greatly impedes successful interdisciplinary work, which may progress for extended periods on a falsely constructed sense of common understanding. While this obscurity had already been observed by Borgman (2015), the KPLEX project results provide not only empirical evidence of the extent of this phenomenon, but also of its eventual negative consequences.
One very marked example of this can be found in the word 'data' itself. This is a word that has not only achieved an almost fetishized status in popular discourse (with data having been deemed to be 'the new oil') but also raises well-known and documented resistance among humanists. Akers and Doty (2013, p. 16) noted a comparatively high degree of uncertainty among humanists around the status of their sources and products as data, a finding largely confirmed by Mohr et al (2015) and Thoegersen (2018), though with some further optimism in the lattermost of these. An interesting and rich confirmation of this can be found in the 2018 Twitter thread launched by Miriam Posner, asking of humanists whether they would view their sources as data. The answers were as interesting as they were varied, but they were also, by and large, negatively valanced toward this particular term, characterizing it as purely numeric, but also derived, impoverished, simple, declarative, and indeed even suspect (Posner, 2018). Most striking among the responses was perhaps one that seemed to address much of the 'anxiety' that seemed to lurk under the surface of the thread in the summation that 'Calling a source data means that person doesn't value it or respect its integrity'. Although they are less sceptical and more expansive in terms of how data are defined, it is interesting to note that a similar heterogeneity exists in how researchers in the field of STS define the word 'data' (Nugent Folan and Edmond, 2020). What the KPLEX data show is the mirror image of this phenomenon of how humanists view the concept and semantic field covered by the word 'data': if humanists resist the term data because they find it impoverished, computer scientists seem to define it in a way that gives it almost unlimited scope. The word data was defined by the KPLEX interviewees in this group as 'text' (WP2-I1), 'stored information that I can manipulate, search, query, get some statistics about' (WP2-I2), 'anything that I am analysing, or using to train a system' (WP2-I4), 'any material that you have in hand . . . like digital material'. (WP2-I3), 'everything that I can use to study a certain subject' (WP2-I5), 'information that could be quantified. . . . that you would use'. (WP2-I7), or 'any piece of information that can be . . . recorded in an index', 'just evidence' (WP2-I9), 'facts, collected through facts. . . numerical facts that given in periodical time . . . also in a sentence or alphabetical'. (WP2-I12), or 'any piece of information, literally anything, but if you're looking for a computer science point of view, any structured bit of information is data'. (WP2-I8). This tendency to take an exceptionally broad view of data was perhaps best captured in the statement 'data exists, it does exist, it just exists in and of itself'. (WP2-I6) The clear trend running throughout these examples points toward an epistemic cultural bias within computer science utterly different from that shared by Posner's humanists, one that views data as broadly encompassing and in terms of its function or utility in the research project rather than a complex set of information objects that come with biases built into them and which might merit significant meta-reflection. The exact meaning of the word 'data' at any given moment for any given researcher is shrouded in the layers of context that shape the discursive interactions of computer science. As a side note, it is also interesting to observe that almost all of the respondents used the word 'data' in the singular, a habit of thought that seems to underline this apparent bias within their shared epistemic culture against reflecting on the nature, status and role of the term data in their work.
Perhaps all the more interesting in this context, then, were those interviewees for whom the request to define this key term for their work-and there is no question that the term is central, appearing between 50 and 220 times in each 60-75 min interview-was met with some discomfort or resistance. Two interviewees began their responses with the very honest disclaimers that they either didn't 'have a perfect text definition of data'. (WP2-I8) or that it was 'not clearly defined' for them (WP2-I13) while another claimed not to use the term to describe any aspect of his work (WP2-I10). Perhaps most striking in this respect was the gap in the confidence demonstrated by these researchers between their ability to work with certain kinds of material and talk about it, as, for example, in the following response: 'I don't think like my opinion is that important. I try to explain what I know. I think of data as. . .I just mean that I don't have maybe enough knowledge in the area. I know some things, but there are definitely like way smarter people but I try to give you what I have'. (WP2 I-3) It is perhaps a humanistic bias to expect expertise to include a precision in language around key terminology, and a few respondents did offer alternative words that they would use instead of data in certain situations, such as 'content' (WP2-I9), or a 'corpus'. (WP2-I13) The gap in the confidence of these researchers between their ability to work with certain kinds of material and have the meta-language to speak about it was, however, striking, and raised significant questions about the function of linguistic boundary objects in the scientific negotiations of meaning occurring by and through data, transfers of the sort that have become commonplace, if not uncomplex, in the digital humanities. complexity within datasets so as to enhance the ease with which they could be modelled and interrogated and the errors and/or lack in traceability this stripping of context inevitably required. While it may seem obvious to some that, to paraphrase Pelle Snickars 'if content is king, context is its crown' (Snickars, 2012), this value is nonetheless either hardly recognized, or viewed to be under existential threat, in the communities represented in the KPLEX interviews. The datafication (generally understood as the rendering of original state objects in digital, quantified, or otherwise more structured streams of information) of complex phenomena inevitably implies decontextualization. This loss was viewed with more or less concern depending largely on the disciplinary and methodological positionality of an interviewee, depending on how they valued a gain in comparability and processability in relation to the reduction of real-world phenomena to (most often) observable and countable (discrete or continuous) variables. Amongst the researchers we interviewed within the KPLEX project, interviewees working at the nexus of such decisions identified a number of distinct risks inherent in decontextualization. First of all, the structuring, collection, and analysis of data are highly dependent on the epistemology and the methods applied by the person or process applying the structure and, as Bowker and Star (1999, p. 5) observed, the application of a structuring device represents an ethical choice, in spite of the fact that it is not always presented in these communities as such. While anthropology-a discipline which is uniquely sensitive to the positionality of the researcher in the field-may present an extreme case for the interrelatedness between epistemological decisions and datafication, the following quote from the KPLEX interviews representing this disciplinary perspective illuminates the dependency of datafication on the positionality of the researcher:'So, from an anthropological point of view/We had all these "writing culture" discussions on subjectivity. And that we have always to see the data we produce in relation to the researcher. And his or her standpoint in the landscape. Or standpoint in the situation, in the field. And these discussions about data storage imply that data is independent from the researcher. And from an anthropological point of view, I would say this is a step back'. (WP4-I15). In contrast, the mutability of conceptualizations of data discussed in the previous section belies a very different stance toward data, which seems able to change its status (and therefore its context) according to the users needs from it. Rich metadata documenting the epistemological approach that was chosen in the datafication process, the methods used, and how and where the dataset was constructed can guard against too much loss, but doing so creates a second area of risk, that is that the metadata designed to aid comparability and findability becomes more complex than the original object. The data/context trade-off has consequences for the interpretation of these data, as context that exists outside of the datafied records cannot influence the results of search queries or interrogation processes in a computational setting. Such information cannot be wholly neglected, however, especially in the social sciences and the humanities, where differing historical, economic, social, and cultural contexts and the provenance of the data have to be taken into account as a precondition for any knowledge claim. Loss of context is often seen as a necessary precursor to computational analysis, but when it comes to artificial intelligence, where neural nets are fed with big data, lack of context can imply severe constraints. This had already been noted in 1972: 'Artificial intelligence must begin at the level of objectivity and rationality where the facts have already been produced. . . . But these facts taken out of context are an unwieldy mass of neutral data' (Dreyfus, 1972, p. 193). This line of critique has recently been taken up by sociologist Harry Collins, who insists that the unresolved challenge of AI is 'the need for computers to be embedded in social context in the same way that humans are embedded in social context' (Collins, 2018, p. 7).
The KPLEX interviews remind us also of how contextualization also facilitates, almost paradoxically, a reflection of what is not contained in the data, necessary to avoid blind spots in their analysis and interpretation: 'And then, I am always kind of worried, that because of the fascination with the empirical approaches, because it looks so technical and so ingenious sometimes, we forget about the historical dimensions'. (WP4-I4) Computers are exceptional tools for identifying patterns across large amounts of data, and they are not subject to the cognitive biases characteristic of researchers, though they may carry the biases of those that designed them. The opportunity for (digital) humanists using them lies therefore in DH observations on big data research Digital Scholarship in the Humanities, Vol. 36, Supplement 2, 2021 ii99 developing methodologies capable of resolving the data/context trade-off, realizing such as provenance tracking and the clear recognition of data limitations and gaps in actionable ways. Such processes could start from critically assessing the context of the creation of the dataset by posing questions such as 'Where did it come from? Who collected it? When? How was it collected? Why was it collected?' (Krause, 2017) and by applying digital source criticism as proposed by Fickers (2013). This would allow one of the epistemic peculiarities of the humanities-a nomadic alternation between study of the research material, contextualization, and interpretation-to deliver more responsible and richer approaches in datadriven environments. In order to maintain processing efficiency, these higher level reflective layers could be considered as separate processing loops, just as they might be deployed by a historian or literary scholar, by which some parameters remain outside of the system, and initial results are always subjected to further loops of meta-criticism (Have data become decontextualized? Are results reliant on areas with or near known gaps in the digital records?, etc.). For a notable sketch of such a 'Computational Grounded Theory' in the social sciences see Nelson (2017).
4 Tools and Standards Are Pharmaka, Giving Much but Taking As Well The KPLEX project also discovered strongly opposed attitudes toward the application of standardized approaches and models to knowledge organization and modelling, so much so that they came to be viewed like the pharmaka of reading and writing described in Plato's Phaedrus, knowledge technologies that can be used to help and heal or, in the wrong manner or dosage, to harm or kill. This is of course an issue long recognized as a sort of 'grand challenge' within library science (cf. e.g. Bowker and Star, 1999), in documentational (or archival) thinking, where indexing forms a cultural technique supporting scientific investigations by establishing typologies and classifications, thus positioning the subject within an epistemic regime (Day, 2014), as well as in many disciplines and non-research applications of models (cf. Morgan, 2012 regarding economics, or indeed O'Neil, 2016, if we want to look at public and corporate manifestations). If context is a threatened key to a nuanced understanding of data, the challenge of working within standardized environments is to provide for this within given metadata or other organizational standards and systems. But metadata standards are both liberating and limiting in equal measure. This becomes especially evident from the perspective of cultural heritage institutions, where shareable knowledge about digitized collections is often limited to what can be encoded into descriptions whose richness is mediated by a potentially limited or limiting standard. While these collection descriptions open up the holdings of an archive to the users and make them optimally interoperable (internally and externally), they may also produce a reductive homogeneity that contradicts the variability and complexity humanists (and indeed, humans) may require, while also silencing other kinds of tacit or non-standardized explicit knowledge in or about the collection, or other ways of parsing the rich knowledge they represent. Hesitation around the application of standards arose in many places in the KPLEX interviews, for example in the generally cold reception among researchers of the standard for marking up emotional references, EmotionML, but was nowhere more pronounced than in the interviews conducted with professionals from cultural heritage institutions. The use of metadata standards was seen as helpful if they enabled practitioners to provide context about the structure in which each individual item is embedded. As one practitioner described the challenge: 'I think generally it's useful if in the way you sort of structure your information you can capture relationships between objects and even things that aren't objects . . . I see less value in just describing an item on its own and more value in trying to in the way collections [sic] are put together, made available, to sort of build some of those links or make them more visible'. (WP3-I10) Structure too needs to respect the context in which it operates, a challenge to the benefits or widespread standardization across collections and institutions. While the force of formalization enables interoperability, aggregation, and scaling, at the same time it may impair the iterative adaptation of parameters to which humanists are accustomed as well as the freefloating discovery of meaningful relationships (Saklofske and the INKE Research Team, 2015). In order to avoid levelling of differences and to bypass reductive and limiting metadata structures, the experts establishing those data have to anticipate which possible research roads the users will take. But even though the standardization of formats and metadata decontextualize and depoliticize data, such formalization enables those who conduct complex queries or statistical analyses to work with such data without directly engaging with them or understanding how the data have been compiled and organized (Wilson, 2011, p. 867).
The problem of narrowing knowledge that these concerns point towards was also reflected in the wider concern that emerging knowledge creation norms might ossify into de facto processural standards that could hide biases, 'hidden' sources, and gaps in the imagination of descriptive systems. For instance, the traditional hierarchical structure of collections through which contextual connections could be traced was felt to be losing ground to 'Google lookalike' (WP3-I7) keyword searches: 'People are so adapted to the Google search that they don't even know anymore that there is a different way of searching. . . . but we feel that this is more connected to the way in which researchers today, especially the young generation, use search engines and tools that we developed in the past'. (WP3-I3) Google-style keyword searches thus may marginalize sources, limiting research results to content a computer could register, bypassing the often tacit, but essential, knowledge embodied in the archivists. Opening and simplifying access to complexity to the extent of imitating the functionality of Google was also seen to present its own dangers of obfuscation: 'Sometimes they [people who use archives and archival descriptions] of course compare us to big players like Google or something like that. No archive can work like Google. We don't have the manpower or the finances of Google. But to present data in a platform like [X] has on one hand a chance that people are asking for your holdings. On the other side it's a big, big danger that they are only looking for that information and don't realize that we might have more'. (WP3-I1) The kind of openness promoted by Google and its like can hereby become a chimaera if users are losing their awareness that what they see is a particular representation perhaps two or more times removed from the original objects of interest: once by the process of digital documentation, once by its standardized registration. Moreover, interviewees envisioned research methods continuing to become yet further distanced from the researcher's hand as automated tools, machine learning, and AI grow to play an increasing role in the near future, a position echoed in other preliminary work as well (Kim et al., 2016).
The tension between ease of processing and the need for an overview through the many layers of potential richness in any object of study, even a digital one, is outside of the documentation challenge that datafication poses. Once systems are too big to verify in detail, we must accept that the richness of original objects may be lost, or, alternatively, find new methods by which to ensure that the uncertainties introduced by standardized approaches to data management remain a part of knowledge, rather than a threat to it. This challenge is not one of computation-while even some humanists might propose that a Bayesian approach to managing uncertainty can be applied effectively in their work (Lavan, 2019), what is perhaps more pressing is to find ways to ensure that the richness of humanistic strategies to make the conditions for knowledge claims apparent within or alongside those claims (be they realized in paradata, commentary, comparison, referential narratives of provenance, etc.) can be adapted for use more widely.

Data without Theory Is as Problematic as Theories without Evidence
The researchers interviewed within the KPLEX project underscored wider objections against claims that big data delivered the 'end of theory' (Anderson, 2008). Although Anderson's now somewhat infamous argument may seem simple to refute, in particular from a DH perspective and in DH contexts, it actually resonates quite strongly with some of the responses of the computer scientists interviewed within KPLEX related to the threat of introducing confirmation bias into data. Although the questions in the interview related to narrative, rather than theory, concerns regarding how data might be misinterpreted are potentially DH observations on big data research Digital Scholarship in the Humanities, Vol. 36, Supplement 2, 2021 ii101 highly transferable: 'Like, any sort of model or meaning or you know representation or any of these kinds of stories that we come up with, they're not an accurate representation of reality, for sure. And as much as possible, people try to use data to back it up, to show that their narrative, their representation of the world is correct' (WP2 I-6). This is also echoed in Breiman's (2001) 'two cultures' in statistical computing: one of which operates with stochastic models (i.e. simplified models of reality), the second one treats the data mechanism as unknown, and not necessarily needing to be known as the goal is to optimize results within a system rather than address a context (such as a research question) from outside of it, a case when the rejection of theory can again seem valid.
The empiricist standpoints carved out in the discussion triggered by Chris Anderson's article can be summarized in three fundamental claims. The first claim asserts that big data can capture the whole of a domain and provide full resolution. Secondly, it is argued that there is no need for a priori theory, models, or hypotheses. Thirdly, it is purported that through the application of agnostic data analytics the data can speak for themselves free of human bias and framing, and that patterns and relationships within big data are inherently meaningful. The first claim with regard to big data striving for exhaustivity was contradicted by KPLEX interviewees pointing to the specific interests driving big data research: 'within big data there is a lot of behavioural tracking. . . . what they're doing is . . . they answer the question: What do I have to do in order to make the most profit out of the website? . . . If their question would be: What does make my customers feel the best? . . . with the behavioural data you don't get any information about how they feel. You just get information of whether they stick on the website or whether they move away from it'. (WP4-I13) The undirected collection of big data for its own sake was openly criticized: 'I think the greater problem to be honest is that we have a lot of data that we cannot make sense of. I mean we collect tons and tons of data and most of that is basically unused because we don't have enough theory to interpret it, based on the theory. That for me is the biggest problem, the datafication itself'. (WP4-I13) While big data claims to strive for exhaustivity, it can generally be questioned whether it is able to capture a whole of a domain: 'all data provide oligoptic views of the world, not panoptic ones: views from certain vantage points, using particular tools, rather than an all-seeing, infallible God's eye view' (Kitchin, 2014, p. 133).
The second claim declaring the obsolescence of a priori theory, stochastic data models and hypotheses is underlined by the use of algorithmic models based on machine learning which treat the data mechanism as a black box and aim at strong predictive accuracy (Breiman, 2001). In contrast to this, human sciencefocussed researchers confessed that interdisciplinary co-operations often failed because of the lack of a shared theoretical framework: 'I think [a] lot of the big data projects have failed miserably. They don't find the things that they wanted to find because they didn't have a theory. So, you need something to start with'. (WP4-I11) Rather than relying on the exploration of correlations and patterns, the need for scientific discovery to be guided by previous findings and theories was accentuated: you probably remember very well Mary Douglas "Purity and Danger". . . . So that's . . . something that I used as the kind of theoretical opening to the idea that there's certain kind of patterning around the image of the nation. . . . this was very funny. Because I've talked about Mary Douglas's research to like big data researchers. And there's always somebody who completely gets it. And then there are others who completely don't get it. (WP4-I11) Theory at least implicitly plays a role when it comes to analysing and interpreting information; even if big data is not collected according to a specific theory, the latter is still necessary to perform the shift from information to meaning: 'So it is one thing to establish significant correlations, and still another to make the leap from correlations to causal attributes' (Bollier, 2010, p. 16).
The third claim that data are objective, neutral, and free of bias was echoed by those computer scientists interviewed in the KPLEX project who insisted that data is 'just evidence' (WP2-I9), or 'facts, collected through facts . . . numerical facts that given in periodical time . . . also in a sentence or alphabetical'. (WP2-I12) Such an empiricist framing of data seems to be oblivious to the insight that 'the term fact can simultaneously mean what is fabricated and what is not fabricated' (Latour and Woolgar, 1979, p. 236). Trained researchers should be well aware of the fact that data do not come into being naturally and that they might be biased for various reasons, however. As one of the social scientists interviewed within the project underlined, this need not necessarily be disadvantageous for research: there're always biases. And we have to acknowledge them. That sometimes human bias is a very good thing. You know, we come from a country with a lot of gender politics. That's mostly a bias. But we think that this is an important bias to take gender into account and sometimes favour women. Or sometimes favour minorities. So, these are like human biases that are needed for society building and advancement. (WP4-I11) In contrast to the framing proposed by Anderson, the KPLEX interviewees from applied research domains underlined that epistemology has to be conceived of as a process driven by research questions and hypotheses, a process to which the establishment of data structures and data collection are inseparably linked. The challenge to deal with big data triggered a certain discomfort among the researchers interviewed, since these data do not necessarily navigate research into intellectually well-framed directions: 'big data might open up a more, a greater space for research, but still you cannot be sure that the answers you might find or insights you might find will answer your original question. . . . it is no automatism that the more data, the more answers will come' (WP4-I2). Data-driven research can therefore seem to stand in opposition to the methodologically controlled navigation through terra incognita which characterizes scientific research processes, rather than based upon 'hypotheses that are more widely tested, which in turn are used to build and refine a theory that explains them' (Kitchin, 2014, p. 135). For such reasons our interviewees pointed towards the need of theory for the interpretation of these data-also for the identification and investigation of those parts of data which matter for a specific research question: 'What is noise? . . . So, whether you consider something to be noise or not is depending on how you frame your problem. . . . I think that we are actually after having more noise because we might find out that there is information in that noise' (WP4-I2). Theory is thus seen to be indispensable to discriminate between relevant and irrelevant data within the vast collections available.
6 The Power Structures of Technology Inhibit Accommodation of Analogue or Hybrid Narratives Although most of the individuals surveyed or interviewed for the KPLEX project recognized that it was unlikely all culturally important source material would ever be digitized, a certain engineering imperative to build systems for the digital data that exists already, regardless of what they might obscure from view, was still apparent and viewed with some concern by the users and curators of cultural data. In what could be described in Foucauldian terms as an 'epistemic rupture', the shift from analogue finding aids to digital archival descriptions produces frictions with as yet uncertain consequences. These frictions are driving at least two phenomena: First, the relationship between collections management professionals and users is changing profoundly. This brings the threat of losing an overwhelming amount of data which was not formally recorded as data or metadata, stored as tacit knowledge in the human systems of cultural institutions, who still assume that a direct and dialogic relationship with users would be maintained: 'And our descriptions here are very, very detailed. . . . But at the end, you know the whole thing. . . . even if there are some technical [reasons a user] can't find it, we can find it' (WP3-I1). In contrast, the cultural bias toward on-line, accessible information implies that such dialogic exchange with the collections experts may be superfluous. Furthermore, as a second major outcome, the power structures regarding access to knowledge and knowledge creation are changing. Archivists themselves noticed their own detachment from knowledge they presided over: It does in the way that you are not working on item level anymore, you are trying to subtract the general meaning, the general line from a collection . . . For every new collection that DH observations on big data research Digital Scholarship in the Humanities, Vol. 36, Supplement 2, 2021 ii103 comes in, you can't go in depth, reading every page in detail, you skim through and you seek the major subject. We're not as close to the items anymore, we're close to the collection as a whole. (WP3-I3) While archivists may maintain a bird's eye view on the holdings of their archive, this alienation process implies at the same time a loss of their gatekeeping function and a loss of control over the historical record. With users becoming enabled to explore the entirety of collections online (or, at least, to feel they can), their terrain of discovery becomes enlarged, and the institution-based professional's scope for contextualization gets reduced. The data become unmoored from the context in which they were created, and this 'unmooring enables the power/knowledge of the database to travel and be deployed by others' (Kitchin, 2014, p. 22). The result of such a mobilization of information has been analysed by Bruno Latour, who described that networked science 'builds extraordinarily long, complicated, mediated, indirect, sophisticated paths so as to reach the worlds' on the basis of 'immutable mobiles' (Latour, 2010, p. 111), a term he created for such stable and transferable forms of knowledge that are portable across space and time.
If practitioners' knowledge of context, which is critical to interpreting how archival sources might be useful in relation to other research materials, is no longer activated in the dialogue with the users, this may impoverish users' understanding of potential uses of sources. Beyond that, archivists were well aware of the dark side of discoverability where the digital material they expose becomes vulnerable to misuse: . . . data-linking is one of the limitations we have to take into account, and it's one of the primary factors in terms of restricting data, because even if you've removed all the direct identifiers, maybe indirect information that could be used to identify them . . . and the identifiers that are used, even if they're a numeric ID, could be linked to an existing dataset. It could be linked to the personal data that people have stored elsewhere than they're supposed to. . . . artificial intelligence has the potential to draw new conclusions from a large amount of data, particularly unstructured data, which . . . until quite recent years have resisted the broader analysis . . . if automated tools are able to make links between those datasets and then . . . infer conclusions about the people, if it's identified, then there's a significant danger to them. (WP3-I5) Although the potential that data might be intentionally or even unintentionally misused in this way, even by data professionals, was not a question we foresaw to include in the KPLEX interviews, the extreme mobility of referent in the application of the term data among the computer science cohort implies that vigilance toward the source and contextual implications of data might not be foremost in the values of the discipline. Placing the curator's concern about artificial intelligence against this background illustrates the power inequality between data producers and data processing facilities, since 'only already powerful institutions-corporations, governments, and elite research universities-have the means to work with them at scale' (D'Ignazio and Klein, 2020, p. 41). In the KPLEX context, it was clear that mainly the big tech companies were seen as threats in terms of the potential of aggregated data to lead to de-anonymization and sensitive data becoming public through data linking. Such actors have not only the resources to collect, store, aggregate, and analyse data; they also dispose of the computational power and capable developers of algorithms needed to process aggregated data. In this way, power inequalities are inscribed into facilities providing analytic techniques such as statistical computing or the application of artificial intelligence. Moreover, and in contrast to universities and governments, tech companies also do not need to answer to either an internal ethics board or, ultimately, the taxpayer for the way they spend their money. The distinction between public and private data analysis laboratories is therefore connected to questions as to who the legitimate producer of knowledge is and what credibility can be accounted to which agency (Haraway, 1996, p. 432). Moreover, knowledge creation in private laboratories is not subject to approval by the expert consensus emerging in scientific communities (Oreskes, 2019, p. 57). The threat of de-anonymization does therefore not only mark a zone of insecurity with regard to the fact that it is not yet certain that current regulations and best practices preclude this, it also refers to unequal power structures of technology and clarifies why cultural heritage institutions and archivists are being held back from their goals of sharing their collections, exacerbating the threat of narrowness of focus. This divide between archival and computational thinking toward these issues illustrates an inevitable consequence of the power shift in knowledge creation, which must be taken seriously and not as an evidence of the 'backward-mindedness' of cultural heritage practitioners. In this way, it also points toward a relevant and perhaps distinct aspect of the embedded values system of digital humanities, bringing an engrained awareness of the sensitivity and complexity of the cultural record and a strong interdependence with cultural heritage institutions to data-driven environments.

Conclusion
Each of the challenges discussed above will likely be recognized by anyone who has worked within the digital humanities for any period of time. That said, the tendency within research across all disciplines to focus on and reward the product of science, rather than the process, does not incentivize the explicit surfacing of these aspects of DH work. This same tendency also minimizes the nature of interdisciplinary, collaborative work as inherently more, rather than less, challenging in terms of its processes than mono-disciplinary, individual scholarship. What does it mean, then, to work in a field uniquely able to negotiate some of the most essential conversations of contemporary research, and indeed of contemporary life? The conclusions of the KPLEX project would suggest the following actions could not only strengthen the future of the digital humanities, but also leverage the unique experiences of digital humanists to strengthen all research operating at the cultural-technical, or any, epistemic divide, forming a practical basis for the further development of the emerging subfield of critical digital humanities.
(1) Digital humanities should aim to be at the forefront of documenting and sharing its processes and successes and failures in negotiating knowledge creation between very different epistemic cultures. Historians of science have, by and large, ignored the arts and humanities over the years (Daston, 2016), and we should not let the same fate befall the digital humanities. Instead, we should use the powerful tools our composite areas of interest place in our hands, from open notebooks to ethnographic and other forms of meta-research, encompassing such pillars of DH as the critical assessment of data and their contexts, a focus on construction of datasets as a pre-condition for their reuse, and a heightened awareness of the positionalities of the people and institutions behind the data, to ensure our field is recognized for its methodological, as well as domain, innovations. Only in this way can we truly and emphatically support our inherent interdisciplinarity, and share valuable processural enablers, such as validated development pathways for shared data formats and structuring approaches, explicit management of how misconceptions are surfaced and resolved, and the incommensurability of research questions and methodologies.
(2) Even as we adopt more and more sophisticated technology, we should never lose the values of the humanities from our work. The richness the humanities frame of reference brings to DH projects should not be taken for granted. It suffuses the questions we ask about how we use language, how culture influences and shapes all that we do and create, how the context and provenance of documents and data shape and change how they might or should be understood and reused, how the patchy progress of the digitization of cultural records threatens to rewrite history from the perspective of another set of victors, how conceptions of key concepts such as privacy and sharing may be differently determined in different communities, and how unintended consequences can be imagined. These are central questions, and far too little discussed. (3) To take this a step further, DH (as opposed to e.g. cultural informatics) also grants its practitioners a unique perspective to understand, expose, defend the methods of the humanities, in all of their exploratory, iterative, critical, speculative, dialogic complexity-a wellspring of insight and inspiration we should not lose. At the same time, we need to ensure that the vibrancy of these approaches and how we apply them does not come to depend on taking a defensive stance about terminology such as 'data'. Minding the potential linguistic and cultural gaps is a process that must proceed in both directions. (4) Finally, the rich but largely tacit knowledge held within the digital humanities should be more often applied beyond the sphere of our own research. Technology has taught us much about how to understand the humanities, but the related core skill set is also needed in technology development. With its focus on language, cultural production, and the arts as they have developed and been used as cornerstones of individual and collective identities over the millennia, the humanities are positioned to speak about the development of contemporary society in a unique, powerful, and underutilized way. Combined with an understanding of what digital technologies and tools mean and do, they are essential contributors to the future, with a role very different from what we already see in the basic and applied social science approaches of STS of indeed HCI.
The irony of the findings of the KPLEX project from the digital humanities point of view is that the aporias that seem to hinder progress in the kinds of processes we engage in, are of course not visible within DH as aporias at all. Although these gaps and blind spots have a significant impact on conversations about and results of technology development-in particular but not exclusively as pertains to big data-when reaching across the boundaries of disciplinary norms and epistemic cultures become central rather than peripheral to progress, creative and productive compromises can be found. As the use of digital tools and methods becomes more commonplace as a method for research in the humanities, the assumption is that the term and the specificity of digital humanities will disappear, becoming subsumed back into domains and disciplines that had previously expanded to accommodate them. Looking at the digital humanities from the perspective afforded by the KPLEX project may however give an indication where and how the field might still find room for expansion and continued contribution, even as its core approaches become mainstream in a postdigital age. Such development will have the potential to leverage an even more exciting 'humanities turn in our understanding of the computational and the digital' (Hall, 2011, p. 2).

Funding
This work was supported by the European Commission's Horizon 2020 research programme, grant number 732340.