Text mining Mill: Computationally detecting influence in the writings of John Stuart Mill from library records

...........................................

the resources and permissions to transcribe extant library registers, and on access to previously digitized sources.Related copyright and privacy restrictions mean our approach is most likely to succeed for other leading eighteenth-and nineteenthcentury figures.

Introduction
How can we understand the relationship between the books an author consults in a library, and those they write?How can computational methods be used to trace how one individual library has affected the work and public interventions of an author?Under what circumstances will this be feasible, possible, or practical?We report here on an international collaboration that aimed to explore these issues via the reading and writings of the British philosopher, economist, and politician John Stuart Mill (1806-73), focusing on his relationship with the London Library, an independent lending library in London, UK, which Mill was an engaged member of for 32 years.Detailed archival research of the London Library's lending and donation records, followed by an assembly of a digital library of both these texts and the publications Mill produced, enabled text mining, and natural language processing (NLP) approaches to detect textual reuse and similarities between passages of writing in the texts, and further close reading to establish relationship, context, and meaning between works borrowed and works written.
Building on a closely documented analysis of the archival record, and related synthesis of the results (O'Neill, 2015(O'Neill, , 2016(O'Neill, , 2019)), it is demonstrated that in text mining the books John Stuart Mill borrowed from and donated to the London Library against his published outputs, it is shown that the collections of the London Library influenced his thought, transferred into his published oeuvre, and featured in his role as political commentator.Intense periods of reading around a common theme can be identified in authorial practice, and books Mill consulted from the London Library can be found referenced extensively in his publications, showing this institution's importance to his work and public life, which had been previously unrecognized (O'Neill, 2015(O'Neill, , 2016(O'Neill, , 2019)).This article concentrates on the computational approach used to underpin these findings.
Our findings will be useful to others wishing to compare and contrast the content and product of libraries regularly consulted by authors: we show that this combined archival and digital approach is an effective and efficient means to interrogate the historical borrowing record of a leading intellectual figure.We also demonstrate we have extended the remit of research that can be undertaken with authorial libraries, library issue registers, and borrowing records, fundamentally reconceiving library issue records as data, which can be used at the start of a digital continuum.We demonstrate how the triangulation of borrowing record, growing access to digitized resources, and the use of computational tools developed for literary study, holds a rich nexus for nineteenth century author and bibliographic studies.We also show that this approach is one which depends on interdisciplinary researchers working alongside computational methods rather than researchers depending entirely on automation.However, our approach has various dependencies: it is reliant upon the survival of historical and archival issue records, and on gaining the permissions and resources to consult them fully; it is dependent on full-text access to previously digitized resources (only a selection of which may be available due to copyright and other restrictions); the applicability of this method may be limited to other leading nineteenth-century intellectuals, given ethical concerns and additional complexities of modern privacy legislation.

2015)
. Mill shaped nineteenth century British and international political discourse with his extensive publication record, including Principles of Political Economy (1848), On Liberty (1859), Utilitarianism (1863), The Subjection of Women (1869), and Three Essays on Religion (1874), all of which, naturally, cite other sources.Understanding how the books he read fed into his own published output can help us follow the development of his thought and influences.Somerville College, Oxford, holds the best-known collection of Mill's books (1,674 volumes). 1Less well known is Mill's membership and use of The London Library, 2 itself a surviving Victorian institution: a subscription lending library, operating in central London since 1841 (Harrison, 1907, Nowell-Smith, 1958, 1972, Baker, 1988, 1990, 1992, Lynn, 2006, McIntyre, 2006, O'Neill, 2015, 2016, 2019).Mill began his relationship with the London Library in 1840 as an expert subject adviser for the acquisition of books on Political Economy, Logic, and French Histories and Memoirs before the Library opened, becoming a founder life-member, and remaining in membership records until his death in 1873 (O'Neill, 2016(O'Neill, , p. 257, 2019, p. 187), p. 187).Mill consulted books in and donated books to the London Library over 32 years, but prior to O'Neill's transcription and subsequent analysis (2016,2019), his loan record had never been fully understood.His name as an early book donor had been spotted (Baker, 1992) and a donation of twenty-four titles had been documented (O'Neill, 2019, p. 186), but the number and variety of Mill's book donations was unknown as the books had been integrated into the London Library's general holdings.The scale of Mill's prolific use and enrichment of the Library's collections, and the influence that these holdings had on his own outputs, was not known until O'Neill's detailed archival work, combined with computational analysis described here, to determine both the extent and significance of Mill's loan record and book donations.The 430 books Mill is now known to have borrowed from, and 165 titles he donated to the London Library (O'Neill 2016, 2019) form a substantial bibliographic backdrop to the work of a preeminent Victorian thinker and throw light on a singular, but little researched Victorian institution.
This article describes how computational comparison was undertaken on a digital corpus of books compiled by O'Neill, following her transcription of the list of books Mill borrowed from the London Library recorded in the Library's Victorian Issue Registers.Our experimental research developed from online conversations with the Digital Humanities community on how to best approach our problem from a computational angle (Terras, 2013).The research question was: how could we computationally compare the texts of the books known to have been consulted by Mill against publications written by Mill, to triangulate an evidential relationship between the books Mill borrowed, donated, and wrote?Could bibliometric analysis demonstrate the importance of the London Library to his research and thinking?What methodological issues, opportunities, and limitations do this present for others contemplating comparing the known reading habits of an author to their published output? 3

Related previous research
As James Raven argues persuasively, 'there cannot and should not be one type of history of the book, but many types' (Raven, 2018, p. 141) and so notwithstanding general agreement on basic techniques in Bibliography set out by Bowers (2005), and Gaskell (1995), and more recently by Tanselle (2009) and Werner (2009), it is always necessary to identify the type of bibliographic methods that are being used in this research.Much has been written about authors' libraries to reveal their authorial habits (Sealts, 1966and Olsen-Smith et al., 2009-2019 on Melville;Reynolds, 1981 and1986 on Hemingway;Capps, 1966 andMiller, 2012 on Dickinson), in particular, the links that can be drawn between the items read and those written through careful scholarly research (Sealts, 1982 andFaflik, 2018 on Melville;Tyler, 1995 andJungman andTabor, 2000 on Hemingway;Coleridge, 1980-2001, Jackson, 2005, and Leinwand, 2016 on Coleridge).However, to date, we have found minimal previous research that has attempted to use computational approaches to detect and analyse the influence of authors' libraries upon published outputs: an analysis of Melville's marginalia used digital text analysis to identify sources in his surviving copies of Homer, Shakespeare, and Milton (Ohge andOlsen-Smith, 2018, Ohge et al., 2018) (1980, p. 3): early studies acknowledged the need for assistance from library staff with access to institutional records and expertise in unpicking them (see, e.g.Keynes, but also Harding, 1957 on Thoreau).Previous research on the London Library issue registers has depended on this supported and approved access, but required additional, intense close reading of often difficult to decipher and transcribe content 3 (Baker, 1981, Atkinson, 2013), as was the case in O'Neill's foundational work (2015,2016,2019), although earlier work did not then rely on digital methods, to extrapolate findings further.Gribben's call in 1986 for collaboration between historians, computer scientists, and librarians has been realized in many digital author library projects, including Melville's Marginalia Online. 4The Freud Library (Davies and Fichnter, 2006), The Gladstone Reading Database, 5 and the multi-national and crowdsourced RED, The Reading Experience Database. 6However, digital research upon author's libraries 'is currently limited to the provision of either digital catalogs that make library metadata available . . .or of simple digital copies, which are offered in a viewer and/or as a PDF download' (Busch et al., 2019, although they tackle this by developing a prototype visualization of Theodor Fontane's Library, in particular to identify patterns in marginalia).A recently funded project, Books and Borrowing 1750-1830: An Analysis of Scottish Borrowers' Registers 7 (2020-23), aims to reveal hidden histories of book use, knowledge dissemination, and participation in literate culture, but is yet to report.The advances in this article move beyond digital catalogue or visualization, demonstrating the affordances of sequence alignment techniques to identify textual matches between items borrowed and items written by an author at scale.
The identification of similarities and relationships between passages in large collections of historical texts-including direct quotations, commonplace expressions, plagiarisms, and other forms of borrowings-is of great interest to a variety of humanities scholars, as it can advance our understanding of influence, writing habits, and ethical approaches in a writer's work, while 'placing it in a larger intellectual and cultural context' (Olsen et al., 2011).Relationships between texts are complex and often multi-faceted, ranging from directly attributed quotations to influences and allusions, and a key approach in humanistic study is tracing these relationships (Jardine and Grafton, 1990).Intertextuality is a rich area of technical and theoretical research development, requiring collaboration between computer science and the digital humanities to build upon and utilize the growing number of digitized texts available to researchers via mass digitization sourced from either commercial (Google Books, Microsoft, Gale Cengage), government (Library of Congress, Bibliothe `que nationale de France), or non-profit (Internet Archive, HathiTrust, Project Gutenberg) providers (Olsen, 2009, Smith et al., 2019a).Machine-assisted reading can be used to identify intertextuality, particularly when 'faced with the intricacies of text recycling in historical and literary works, along with the frequently degraded status in which these texts are currently made available' (Olsen et al., 2011), although it has been argued that this is an 'undertheorized' practice in the Humanities (Underwood, 2014).
Here, we are concerned with Text Recycling, or local text reuse, which identifies small regions of similarity, ignoring large amounts of difference, predicated on the pairwise comparison of many documents to identify typically infrequent instances (Seo and Croft, 2008). 8There are many NLP approaches that can be used to do so (see Graham, 2019 andSmith et al., 2019b for overviews).Sequence alignment, which 'divides the source and target strings into overlapping sets of consecutive words . . .called "shingles" or "ngrams"' (Graham, 2019, p. 122) is widely used in bioinformatics, and as the basis for many plagiarism detection algorithms (Lyon et al., 2001, Bourdaillet andGanascia, 2006).However, it has also been used by humanities scholars to detect sources, influence, and allusion in historical texts: in Classical Latin poetry (Bamman and Crane, 2008), in the eighteenthcentury Encyclope ´die of Denis Diderot and Jean d'Alembert (Edelstein et al., 2013;Roe, 2018) Franzini (2016).
Sequence alignment algorithms vary in complexity and resulting computational tractability.An alternative approach using n-gram matching, is presented in Ganascia et al. (2014).Smith et al. (2013) detect clusters of reused texts to analyse the culture of reprinting in newspapers in the USA before the American Civil War, refining the n-gram shingling approach to optimize effectiveness and efficiency by employing hashing for space-efficient indexing or repetition and local alignment techniques to find compact passages with the highest probability of matching.This approach was also used to trace the flow of policy ideas in legislation (Wilkerson et al., 2015;Funk and Mullen, 2018).Recent developments in this method have also included visualization of results to support interpretation (Abdul-Rahman et al., 2017).However, although computational detection of textual reuse is becoming an established method in humanistic study, we have uncovered no previous application of this approach to authors' libraries or borrowing records.

Method
Our research consisted of four distinct stages.Firstly, O'Neill compiled the list of Mill's borrowing record of books held within the London Library, which was foundational archival research on both loans and donations records (2015,2016,2019).Secondly, O'Neill compiled a digital corpus of books written by Mill, from extant online sources (O'Neill, 2016(O'Neill, , 2019)).
Thirdly, sequence alignment NLP approaches to align subsequences and then cluster common passages were used to identify commonalities in texts between the books Mill wrote, and those he read.Fourthly, analysis of the results of text mining enabled understanding of the relationship between Mill's London Library borrowing record and his published output (O'Neill, 2016(O'Neill, , 2019)).We detail our approach, its successes, and its shortcomings, here.
This research was given ethical approval from University College London's (UCL) Department of Information Studies.Given the timescale of the author records in question, there are no concerns regarding the General Data Protection Regulation or the need to obtain permissions from the individuals involved.

Library record compilation
This research depended on the time-consuming, detailed archival work with the library's extant loan records (see Figure 1) undertaken by O'Neill, and the permission from the London Library to do so.The challenges of such archival work are presented in O'Neill (2016, p. 258; 2019, p. 190).Additionally, identifying books donated by Mill within the collection required forensic and extensive consulting of 34 years of internal Library administrative records, catalogues, and supplements (O'Neill 2016, p. 260;2019, p. 190).The extracted loans data presents a unique corpus of 430 books consulted by Mill, albeit for a finite period from the early part of his membership (1842-9 and 1856-7), given the extant London Library issue registers: he is therefore likely to have consulted far more over his membership.This may be a topic for future research with these methods: estimating the probability that Mill quotes from books within the London Library, using their cataloguing and accession records to compile a wider corpus, and detecting matches in his output.In addition, Mill's donations were marked by 3 significant deposits over 3 decades totalling 165 titles (see O'Neill 2016, p. 269-276 or 2019, 379-390 for a complete listing).The records of the books loaned and donated were transcribed and entered into an Excel spreadsheet in order to enable further analysis., 2017).We also acknowledge the limitations that poor OCR of digitized texts can inject into this process, and that depending on the quality of previously digitized content can affect research outputs in unknown ways (Cordell, 2017).Given Mill heavily revised certain monographs, we were dependent on the scholarly edition of Mill's Collected Works (Mill, ed. Robson 1963-91, henceforth referred to as CW), available in an accessible format in the Online Library of Liberty, 10 which facilitated and accelerated close reading of textual matching.Our choice of subject matter is only suitable

Text mining approaches
We benefited from two different text mining approaches previously designed for the detection of textual alignment, with one being a computationally light-weight approach, the other which involves significantly more resource, in the hope that we would be able to identify matches.The intention was not to compare these tools per se, however, using two available systems which differ in approach and execution allows insight into where these tools may benefit others.
We first used TextPAIR 11 (Pairwise Alignment for Intertextual Relations), an open-source software package developed by Roe and colleagues as part of the ARTFL Project at the University of Chicago for text reuse discovery in digitized text collections, originally implemented in 2009, and rewritten in 2018. 12 TextPAIR is an implementation of a very general sequence alignment algorithm for humanities text analysis that supports one against many comparisons using a generalized Python module.Sequence Alignment respects order in documents, and can align similar passages directly, dealing with variations in similar passages such as insertions, deletions, spelling, OCR errors, etc. TextPAIR identifies regions of similarity shared by strings using word or k-tuple heuristics in order to balance efficiency and completeness while identifying occurrences of the same word sequences shared between documents (see Roe, 2012 for further documentation on TextPAIR's approach, and use of TextPAIR in Olsen et al., 2011, Edelstein et al., 2013, Kokkinakis and Malm, 2015).The benefit of using TextPAIR is that the results are stored in individual files associated with each source document, sorted chronologically by year of document publication, where the start of the matched passage is highlighted.This makes for a quick way to see borrowings and quotations, and for a researcher to return to results and identify linked passages.Parameters can be adjusted to loosen or tighten the degree of similarity.In our case, searches were run by Roe on a dedicated server at the Oxford e-Research Centre running the associated PhiloLogic search and retrieval software also developed by ARTFL.Our source and target datasets were indexed and loaded into PhiloLogic before using TextPAIR to pre-process the texts into overlapping tri-grams, or the three-word shingles used for identifying similar passages between corpora.We settled on matching parameters of ten or more words occurring within a sliding window of matched ngrams to avoid over-fitting of many banal expressions, which output an appropriate number of results for a researcher to return to for analysis.
Using the same set of source and target texts, we then used Passim, 13 implemented by Smith in 2012 at Northeastern University and since continually improved (Smith, 2019), which uses probabilistic approaches to text-reuse analysis to successfully detect alignments between noisy OCR sources.The software performs an initial filtering stage using n-gram shingling and then implements the Smith-Waterman algorithm (Smith and Waterman, 1981) with an 'affine gap penalty', which encourages inserted/deleted passages to be more compact.For this corpus of books, we treated each page as an independent document and had passim return a maximally aligned subsequence in each pair of pages.As with TextPAIR above, we pruned away aligned passages with fewer than ten words.(See Smith et al., 2019c for a full overview of the implementation.)Openly available under the Eclipse Public License, Passim has been successfully used in a variety of studies and projects (see Smith, 2013;Wilkerson et al., 2015;Vesanto et al., 2017;Smith, 2019).Passim detects pairs worth aligning where textual variation and OCR errors mean that more straightforward approaches are less robust, but it is therefore more computationally expensive in both memory and time than pure shingling n-gram methods (although the code can be parallelized via Apache Spark, either on a single machine or a cluster).The results are a set of aligned text passages highlighting matches between source and target texts, providing references to document name and page number, allowing the researcher to return to both to undertake close scrutiny.Running Passim on the corpus of books required less than 30 minutes on a fourteen-node cluster at Northeastern.
It is imperative to note that this research is not an attempt at distant reading (Underwood, 2017), or purely quantitative literary analysis making 'a false claim to absolute knowledge and objective truth' (Bode 2012, p. 10).The results from our searches required extensive close reading and synthesis from the researcher, Helen O'Neill, in a process that combines 'digital and computational methods with traditional modes of literary analysis' (Rosen, 2011).The distant reading approaches used here are a 'supplement to traditional close reading practices', as an example of how 'the invaluable resource of digital archives and the utility of searchable databases can be most rewarding when deployed in concert with close reading, archival research skills, and careful argumentation' that 'attend to the complexity and contingency of historical phenomena' (Rosen, 2011).The computational analysis therefore pinpointed where close reading analysis should occur, allowing us to 'quantify without losing the disruptive detail and splitting significations to which we have learned to attend' (Rothberg, 2010, p. 343), as a 'productive way of integrating empirical data with the paradigm of humanities knowledge as a critical, analytic and speculative process of enquiry' (Bode, 2012, p. 8).
The scale and scope of the texts in question requires computational approaches for the identification of potential matches to be feasible; however, the results from this process require in-depth human synthesis and analysis to understand trends and assign meaning.
Textual alignment methods have a bias, as mentioned above, towards high-precision, surface-level matches.Other research projects to apply text-reuse methods to literary influence-such as the Tesserae project at Buffalo (Coffee et al., 2012) or the eTRAP project at Go ¨ttingen (Franzini, 2016)-have focused significant effort on improving recall, e.g. by parameterizing textual variation with synonym dictionaries and part-of-speech substitution rules.However, alignment methods are useful as null models of textual influence.Since each mutation of a text in passing from source to destination is equally likely, we can establish a baseline for future investigations that account in a more nuanced way for authors' transmutation of their sources.In a similar way, null models of gene drift establish a baseline against which certain genes may be deemed adaptive.Textual alignment focuses our attention on the most likely matches.While we can evaluate the (generally high) precision of these methods, it is impractical to perform an exhaustive evaluation of recall.The more one relaxes the matching parameters-to attempt to capture allusion, or 'indefinite or diffused source' (Altick, 1975, p. 94) for example-the more noise is introduced into the system, which can often overwhelm the signal of re-uses.
For this project, the use of metadata in triangulating with ownership and borrowing records allows us to check model output with independent observations to some extent, we are well aware that subtle allusions, unconscious borrowings, and lapses of memory may pass unrecorded (and may be better served by alternative methods such as topic modelling, or stylometric analysis).This uncertainty at the level of individual instances of text reuse, however, can be mitigated by aggregating our analysis at the level of books-which also happens to be the level of our bibliographic analysis.While books whose only contribution to Mill's writing was indirect may thus escape detection, we are more confident in finding the books that made some direct textual contribution.State-of-the-art language models exhibit a growing sensitivity to a range of genres and long-distance dependencies among lexical choices.While these capabilities have to date been fine-tuned on fairly shallow paraphrasing, translation, and question-answering tasks, we expect that future directions in text-reuse research will focus on systems that combine search in dense contextual embedding spaces with models of text mutation trained on collections of documents and their sources.

Understanding the borrowing record
The compilation of Mill's borrowing record is fascinating in itself as a snapshot of the zeitgeist of his age.From the works of leading European economists, philosophers, and historians, to children's books, it reveals Mill's lifelong interest in and affection for all things French; his active engagement with European culture; his attentiveness to women's writing, actions and opinions; and his focus on the economic, political, social, and cultural developments in countries, colonies, and continents across the globe.For a complete analysis of Mill's loans and donations by title and theme, see O'Neill (2015O'Neill ( , 2016O'Neill ( , 2019)).

Results from text mining
Of interest here is the textual matching between loan and publication, where text mining has highlighted important further influence on Mill's thinking that may have gone undetected by keyword searching.This allows us to establish a direct link between books Mill borrowed from the Library, his published oeuvre, his political interventions, and his public profile, which would not have been possible without using computational methods due to the scale and extent of the task.
Seven hundred and ninety-five text matches of strings at least ten words long were found between the 'source' (Loans, Donations) files and the 'target' (Publications) via the TextPAIR approach, and 1863 were found via the Passim system.The difference in these numbers can be explained by the difference in tolerances between each system's algorithmic approach to matching, and correction for errors in OCR.There were many false positives, or more accurately, matches of texts within the digital files that are not necessarily useful to our cause.Some of these are due to the nature, form and content of the digitized texts, and artefacts from the digitization process the systems were comparing, such as 'the borrower will be charged an overdue fee if this book is not returned to the library on or before the last date stamped below', or 'please do not move cards or slips from this pocket' which made it into the OCR-generated text!Some of these matches reflected use of common aphorisms or phrases: 'an eye for an eye and a tooth for a tooth' was found in nine matches between source and target, 'either to the right or to the left and that' was found in eight.The overestimate of the significance of such common phrases results, in part, from the focused input collection.A larger corpus would help both models of text reuse better infer the relative frequency of these phrases or, as in the case of the biblical quotation above, explain away their co-occurrence in two books as arising from a common third book.
Both systems matched the same 500 substantive matches within source and target texts, often between multiple editions of Mill's works.These became of immediate relevance for comparison with the CW to establish the relationship between the published item and source, and to determine what these matches told us about Mill.Smith's Passim system identified significantly more potential matches where the OCR transcriptions were poor, which then required returning to the source texts, and manually checking details for corroboration.The outputs of the text mining, therefore, provide a starting point for the detailed analysis, which requires much human synthesis to rationalize and establish relevance and conclusion, rather than a fully automated process.
A major result of the text mining and close reading analysis was that Mill clearly cited his sources: we do not identify any significant uncited or newly discovered influence.However, this computational approach (as postulated by Olsen et al., 2011) significantly improved 'the manner in which these relationships are linked from text to text.Rather than parsing a reference and link using citation data or outside references schemes-which can be highly variable, inconsistent, and typically keyed on page number of other rather arbitrary attributes' we identified and contextualized links and relationships in an efficient manner (Olsen et al., 2011).Doing this manually would have been possible given Mill's use of citation for these specific sources, but would have been a life's work.Our major finding is to show how the results of sequence alignment can indicate important influence of particular individuals, particular texts, references from particular genres of text (such as French literature), and around specific historical events (such as the Great Famine in Ireland).
Our results give an indication of importance of influence, first by the number of times specific authors or their texts are referred to by Mill.The speeches of the statesman and philosopher Edmund Burke, who was staunchly opposed to the 1789 French Revolution, were borrowed by Mill in May 1848 (Burke 1826-7): there are forty-six references to Burke in Mill's publications, showing his importance on Mill's argumentation.During the 1865 election in which Mill stood for Westminster, he quoted Burke on the hustings (Mill CW,Vol. XXVIII,45) and recognized in him the significance of principled action: What was it which made Edmund Burke, with all his errors . ..soar so immeasurably above the vulgar orators, and still more vulgar statesmen of his day?What, except that he was a man of general principles?(Mill CW,Vol. IV,115).
Mill's respect for Burke is evidenced through the repeated quotation throughout his publications.
Secondly, the text mining allows us to identify parallels between significant books read and significant outputs.The text mining results from both systems generated key textual matches which  (Mill CW,Vol. II p. 136).In discussing the association between labourers and capitalists, Mill used payments to crews of whaling ships as an example, further synthesizing Babbage's arguments to add to his own: Mr. Babbage, who also gives an account of this system, observes that the payment to the crews of whaling ships is governed by a similar principle; and that 'the profits arising from fishing with nets on the south coast of England are thus divided: one-half the produce belongs to the owner of the boat and net; the other half is divided in equal portions between the persons using it, who are also bound to assist in repairing the net when required.' Babbage has the great merit of having pointed out the practicability, and the advantage, of extending the principle to manufacturing industry generally (Mill CW, Vol. III, 1013).
This long quotation from Babbage, which appeared in the first and second editions (1848, 1849), disappears from the third (1852), indicating that Mill continued to revise his reasoning.However, text mining reveals here that Babbage's Economy of Machinery and Manufactures, consulted by Mill from the collection of the London Library, operated as a crucial and central influence when Mill was writing PPE.
Thirdly, text mining allows us to reinforce the importance of Mill's international influences, such as French literature and history, subjects on which he was particularly knowledgeable.Mill's active consumption of the French novel, revealed by his borrowing record, shows a heightened engagement with political, philosophical, and economic thinking in the dominant cultural medium of the day, much read by the London intelligentsia (Atkinson, 2013).In his essay on the writing of Alfred de Vigny Mill contrasted Sue's 'Literature of Despair' with de Vigny's 'touching and beautifully told stories, founded on fact' (Mill CW,Vol. I,488).Three significant French historians appear in Mill's loans record: Jules Michelet; Franois Mignet, and the French speaking Genevan, Simonde de Sismondi: all three were extensively reviewed by Mill (Mill CW,Vol. II,111n).
Strong parallels can therefore be drawn between Mill's consultation of the Topography collections in the holdings of the London Library, and his publications.the numbers of politicians, civil servants, academics, clerics, writers, and prime ministers who were Victorian members of the London Library, Mill's donations introduced ideas from other countries and on contentious issues onto the bookshelves of some of the most significant power brokers in Victorian London in a way which was undetectable given the absence of Mill's personal book label or signature, perhaps avoiding his identification as a radical influence (O'Neill, 2016, p. 277).Currently, work on the corpus continues, as a case study for advanced matching algorithms.It would be logical to extend this work, including: incorporating the titles from Mill's private library held at Somerville College, Oxford into our source texts; returning to the list of required books and using a wider variety of online sources to see if these were now available 14 in digitized format for inclusion in our target texts; extending beyond the mining of Mill's monographs to also using this technique to compare Mill's correspondence, speeches, and articles in the CW against his London Library loans and donations; and applying statistical models to see if the quotations present in Mill's writing temporally align with his borrowing record.This approach should be feasible for other authors, provided access to library issue registers and institutional archival records were possible, and the resources are available to gather the required data upon which to build analysis.It was not our intention to compare available software for text alignment in this study, but we would recommend using TextPAIR to identify core matches between texts and deploying Passim when the OCR is known to be more problematic, or where the volume of texts to be searched would benefit from parallelization, given the differences in computational requirements between the two tools.Both TextPAIR and Passim provided very useful results for this study, and are effective and advantageous computational tools, available for others to employ.
However, there are a variety of generic limitations to this research, which is dependent on a body of existing digitized content, including Mill's oeuvre, and the texts he read (even though we could not get access to digital surrogates of all titles required).Not all authorial figures have their writings digitized so completely, and therefore this method could be most successfully applied to authors whose outputs have benefited from prior digitization, building upon known biases within the historical digital canon which may have consequences for our understanding of the past (Putnam, 2016, Hauswedell et al., 2020).Digitization of cultural heritage content remains incomplete and uneven (Nauta et al., 2017).It is difficult to understand how different our results would be with access to a full set of digitized texts, and we have to provide methodological explanation to continually grapple with incomplete corpora and representativeness (Bauer and Aarts, 2000).We are at the mercy of prior digitization activities, including quality control for generation of high enough quality OCR transcripts to allow even advanced NLP algorithms to identify potential matches, and little information is provided to researchers about the digitization process and how this may affect text-mining approaches (Cordell, 2017;Hauswedell et al., 2020).Researchers operating within this space should therefore do so in a critical manner to understand how the digitization process may be shaping their findings.
Furthermore, there is a legal component to both the affordances of this methodology.Copyright remains a driving force of digitization practices and 'the nineteenth century is particularly well represented in digital archives, owing perhaps to its 'goldilocks' (or just right) conservation-copyright status' than specific academic rationales (Hauswedell et al., 2020).Influences on the writings of other Victorian figures may be successfully analysed using our method, but this is not the case for more recent authors, due to the '20th century black hole' in our digitized cultural heritage (Fallon and Uceda Gomez, 2015).It is also unlikely that researchers will be able to access the borrowing records of modern writers without explicit consent, due to changes in privacy legislation and the resulting appropriate responses from the library sector (Bowers, 2006;Dowling, 2017;Bailey, 2018): it is unlikely that modern reading records will survive to enable this type of research.We therefore suggest that this method is applicable to the reading and writing of authors beyond Mill, but is most likely to succeed, or even only be possible, for other leading figures professionally active from the mid-eighteenth to early-twentieth centuries.

Conclusion
Text mining the books John Stuart Mill borrowed from and donated to the London Library against his published outputs has shown that the collections of the London Library influenced his thought, transferred into his published oeuvre and featured in his role as political commentator and public moralist.This research has moved discourse about the impact of the London Library onto an evidential footing, and also provides a proven methodological approach from which to approach future case studies involving understanding and mining the reading records of other nineteenth century intellectual figures, in order to detect and analyse influence in their published oeuvre.Identifying and showing these links benefited from interleaving computational matching (or 'distant reading'), and detailed, or 'close reading' undertaken on both archival registers and authorial outputs.

Fig. 1
Fig. 1 London Library Issue Book No. 3 showing Mill's intensive borrowing record during 1845, London Library Issue Book Number 3, p. 529.The horizontal lines indicate the return of individual books.The vertical lines indicate that all the books listed on the page have been returned.This is representative of the type of library issue record that required transcription and identification from Mill's loan record.Image reproduced with the kind permission of the London Library.V C The London Library Detecting influence in the writings of John Stuart Mill Digital Scholarship in the Humanities, Vol.36.No. 4, 2021 1019 ; and Antonini et al.
, in detecting reuse of Homeric epics across 15 million words of Greek and 10 million words of Latin (Bu ¨chler et al., 2012).Coffee et al., 2012 examined allusions to Vergil's Aeneid in the first book of Lucan's Civil War (2012).Bu ¨chler et al. extract relationships between different English editions of the Holy Bible (2014).Franzini detected similarities in English translations of the Polish romantic epic Pan Tadeusz by Detecting influence in the writings of John Stuart Mill Digital Scholarship in the Humanities, Vol.36.No. 4, 2021 1017 2012, p. 1), it was unfortunately not possible to locate previously digitized versions of all of the titles.About 255 of the 435 books Mill borrowed (59%), and 91 of the 165 books he donated (55%) were obtained in machine-processable format.A limitation to this research approach is the still patchy digitization landscape (Nauta et al. Detecting influence in the writings of John Stuart Mill Digital Scholarship in the Humanities, Vol.36.No. 4, 2021 1021 identified Mill's germinal work The Principles of Political Economy with some of their Applications to Social Philosophy (PPE) (1848) as a target work of significance (understandable given that Mill would have been reading extensively for this in the period covered by the transcribed loans records).
. Between 1826 and 1849 Mill reviewed Mignet's French Revolution (1826); Scott's Life of Napoleon (1828); Alison's History of the French Revolution (1833); Carlyle's French Revolution (1837); Michelet's History of France (1844); Guizot's Essays and Lectures on History (1845); Duveyrier's Political Views of French Affairs (1846) and wrote impassioned essays on Armand Carrel (1837) and A Vindication of the French Revolution of 1848 (1849).Two French economists and one French speaking Genevan appear in Mill's loans record, all of whom are directly quoted in The Principles of Political Economy with some of their Applications to Social Philosophy (PPE): Charles Dunoyer, M.H. Passy and Simonde de Sismondi.Passy's work Des Systemes de Culture, et de leur influence sur L'Economie Sociale (Passy, 1846) is referred to fifteen times in in relation to peasant and capitalist farming.Sismodi's Nouveaux Principes d'E ´conomie Politique (1819) and Etudes sur L'E ´conomie Politique (1837-8) are referred to over fifteen times and are also cited in Mill's articles on the condition of Ireland during the Great Famine.De la Liberte ´du Travail (Dunoyer, 1845) by Dunoyer is particularly praised by Mill in PPE: