Noisy medieval data, from digitized manuscript to stylometric analysis: Evaluating Paul Meyer’s hagiographic hypothesis

...................................................................................................................................... Abstract Stylometric analysis of medieval vernacular texts is still a signiﬁcant challenge: the importance of scribal variation, be it spelling or more substantial, as well as the variants and errors introduced in the tradition, complicate the task of the would-be stylometrist, by inducing noise and perhaps even interferences


Understanding the French hagiographic tradition
The history of the early French saint's Lives collections in prose is still an enigma. Indeed, at the beginning of the thirteenth century, legendiers (i.e. manuscript containing Saint's Lives collection) were already constituted and preliminary steps are missing. In the case where those collections do not adopt the liturgical calendar, they are often built around thematic series (Perrot, 1992, pp. 11-15): apostles, martyrs, confessors, saint virgins, but the organization within those themes is not clear. One of the hypotheses about the composition of hagiographic collections in Latin with similar structure is that they are a compilation of pre-existing libelli, independent units about one saint or a series of saints (Philippart, 1977). Paul Meyer's 1 work on the composition of Old French prose legendiers  led him to discover that some of these came from successive compilations ( Fig. 1). Using their macrostructure, he tried to organize the French manuscripts into families on the basis of the thematic similarities, proximity of the groups of Lives and recurrent series in the manuscript tradition. The first three collections named A, B, and C are composed by successive additions. Thereby, collection A is a collection of saint apostles' Lives, collection B adds a collection of saint martyrs' Lives, and collection C is the aggregation of collection A, B, and 22 new texts: saint confessors' Lives, saint virgins' Lives, one text about the antichrist and another about the purgatory. New additions to the collection are not united by a thematic object and seem messier. Studying those compilations, Paul Meyer had also the intuition that the collections possess some smaller pieces. He identified a few series using authorship when he could (e.g. Li Seint Confessor of Wauchier de Denain in collection C) and proposed the existence of primitive series based on the repetitive grouping of selected lives in different manuscripts as for instance the series: Saint Sixte, Saint Laurent, and Saint Hippolyte.
Because most of the French saint's lives are anonymous and because the collections were rearranged by multiple editors over time, it is extremely difficult to locate what could have been the primitive series and Meyer could not go further. This serial composition of the Lives of Saints is a datum also noted by other specialists of Latin hagiography such as Perrot (1992) and Philippart (1977), who even points out that these hagiographic series must be studied in their entirety in the same way as a literary work. However, despite these academic positions, to our knowledge, there is no complete edition of hagiographic series, most probably because the identification of the series themselves is a matter of debate, except in the context of a full manuscript edition.
As such, the aim of our article is, first, to determine if Paul Meyer's intuitions and hypotheses can be infirmed, nuanced, or completed. Secondly, we would like to discover if some other links between saint's lives can reveal series from single anonymous authors and help reconstitute some of the hypothetical pre-existing libelli. In order to do so, we performed a stylometric analysis on a manuscript representative of the collection C. To perform this computational approach, and because there is no complete edition of any of the manuscripts holding the collection C, we created a pipeline to acquire the text. After the presentation of the data acquisition pipeline, we will explain how we approach the inherent problems of both Old French and automatic text acquisition variability in our stylometric analysis. Finally, we'll propose an evaluation of the results in regard to the traditional knowledge we have of the manuscript transmissions.

Development and Evaluation of a Data Pipeline
For this work, the BnF fr. 412 manuscript, written in a single hand during the thirteenth century, seems to be a valid source for text acquisition. This manuscript is very ornamented, and juxtaposes calendaries, hagiographic collection, and bestiary. The copist dates the manuscript and identifies the illuminator.
Icis livres ici finist/Bone aventure ait qui l'escrist/ Henris ot non l'enlumineur/Dex le gardie de deshouneur/Si fu fais l'an MCCIIII XX et V (fol. 227v) The handwriting is a regular gothic textualis of rather large modules. The letters are rather tightly packed and the words not very detached, so that it is sometimes difficult to discern cases of agglutination. It is, however, very easy to read and has very few abbreviations. The ink is black and of good quality. It has only lightened in very few places. The manuscript was most probably written in a short time, which does not impact too much writing style (the writer's hand does not 'age'), it is in pristine condition, has been digitized by the Bibliothèque Nationale de France (BNF) and made available on Gallica. 2 To be able to analyze the text, we had to build a pipeline that would, step by step, enrich the data with more information: from pictures to text, from raw text to normalized version, from normalized version to linguistically annotated data so that multiple stylometrical approaches could be combined and evaluated.

Line detection
Handwritten Text Recognition (HTR) has evolved much over the course of the past years, with easy-touse tools, such as Transkribus and Kraken. We distinguish two steps of text acquisition: layout detection (and particularly line detection) and the actual text recognition.
As Garz et al. (2012) put it, 'segmenting page images into text lines is a crucial pre-processing step for automated reading of historical documents': unlike printed books from modern editions, parchments present various issues from ink bleed-through (the capacity of a verso writing or picture to be seen on a recto) to inconsistent background color. On top of these traditional issues, costly manuscripts like the BnF fr. 412 accompany texts with illumination, including historiated lettrines, flourished initials and marginal ornamentation, as well as rubrics, underscoring in this way the discontinuity between texts (Fig. 2). Line detection was found to perform very poorly in Kraken compared with Transkribus, two well-known and performant HTR and Handwritten Text Recognition (OCR) engines. Kraken in its 2.0.5 release contains a traditional line segmenter based on contrast which cannot be trained on a specific layout, while Transkribus is using deep-learning models for the same work. 3 While it should be stressed that we cannot offer a methodical, re-applicable evaluation for this performance, we can definitely say that Kraken would often miss lines, create a lot of false positives in ornamentation, and-not often but enough to be seen-incorrectly sort the lines. On the other hand, Transkribus would rarely miss lines, rarely find text in illuminations (although it could happen), but had some time issues with last lines of columns (Fig. 3). We believe in subsequent results Kraken output to be much noisier than Transkribus.

HTR
Text acquisition was evaluated using both Transkribus (HTRþ) and Kraken. Two datasets have been created for these specific reasons: (1) The main dataset, Pinche Dataset below, is the combination of 271 columns transcribed by A. Pinche, spanning from folio 103r to folio 170v (Pinche, in progress). It has the advantage of having only one transcriber and has been proofread in the context of an ongoing PhD thesis. It contains ninety-six characters (single spaces included), of which twenty-eight are found fewer than ten times, and in total makes up to around 495,000 occurrences. However, it has the downside of being both consecutive and attributed to a single author (Wauchier de Denain). The downside of this dataset is that it was mostly transcribed by nonspecialists and despite several attempts to unify it still presents differences in how the text was transcribed. It  contains 102 different characters (single spaces included), of which forty-six are found less than ten times, and in total makes up to around 70,000 occurrences. Despite the differences in length, and the limited scope of any comparison for such a limited number of samples, the texts transcribed by the students seem to show a higher variability in the number of unique characters (Fig. 4). For example, for the CON letter, we found five variants in the later: UþA76F (114 occurrences, regular con letter), U þ 0039 (twelve, regular nine), UþA770 (seven, modifier con), UþF1A6 (two, 'Latin Abbreviation Sign Spacing Base-line US'), and U þ 2079 (two, superscript nine).
We trained three models, each of them tested on the same subset of Pinche dataset. As expected, the training set from Pinche was more efficient (most probably due to its single expert transcriber). However, we found Kraken to be quite impacted by the recognition of spaces. As such, we trained a supplementary second model that would not try to recognize spaces. The Transkribus HTRþ model performed best on the Character Error Rate (Table 1).
Folios 1r-3v were excluded from OCR, because they contain unrelated resources (mostly calendars).

Word segmentation
As it can be seen in the resulting text (Fig. 5), spaces are one of the least stable features to be correctly recognized. If spacing in handwriting is rarely really regular, Old French manuscripts are the prime examples of it. 4 Indeed, a quick look at one column from the f.10r in Fig. 5 shows that spaces are sometimes really small, sometimes nonexistent. Moreover, there are no marks for hyphenation in the manuscript, which requires us to detect and concatenate some of the tokens passing from one line to another. As an indication, the Kraken model trained with spaces had 905 errors related to spaces from which 810 were deletions and insertions: it represents a 1.82-point drop of performance in CER and an impressive 37.39% of the test set errors. An option is to treat the notion of space as a natural language processing task, where the image is not taken into account. Of course, the notion of words and grammar has evolved and what most of the other tools of the pipeline expect are words perceived as such by modern and contemporaneous medievalists. Unfortunately, due to the extreme variation in spelling of Old French, dictionary approaches do not perform well. In a previous study (Clérice, 2019), we have shown that they do not extend to new unknown domains as much as deep learning models. In this context, we used Boudams, the tool developed for the aforementioned article. It currently removes all spaces before reinserting new ones. We used the Old French model built for that study, which had a 0.99 FScore on the in-domain test set (which contained resources from the Pinche Dataset and TNAH Dataset) while having a 0.945 FScore on an out of domain dataset. Of course, the resulting output is not expected to be perfect (Table 2), and in fact, each step we pursue might introduce new errors as they were not manually transcribed or corrected (Tables 3-5). However, we did keep the output of each step for later stylometric analyses.

Abbreviation resolution, normalization, and lemmatization
With word segmentation available, there were two others forms of the dataset that were needed: one where each word would be normalized and have its abbreviations resolved, a second where each word would be tagged with both its Part-Of-Speech and its lemma. We actually treated normalization and abbreviation resolution as a lemmatization task, as they both require understanding of phenomenon such as prefix and suffix and replace them with a neutral value.
As such, we trained Pie (Manjavacas et al., 2019a,b) on a corpus of Old French transcriptions available in TEI. 5 The training set was composed of around 125,000 tokens (including punctuation), the evaluation set 16,000, and the test set 15,000 taken from both Pinche and Oriflamms project (Stutzmann et al., 2013). They contained abbreviation resolution, accentuation, and punctuation introduction (sen -> s'en). The results were promising with 96.86% accuracy, with 96.96% on ambiguous tokens (whose input can be normalized in different fashions), 91.42% on unknown output form, and finally 90.72% on unknown origin form.
To improve statistical calculations based on occurrence counts, we applied lemmatization. Unlike modern English, Old French is both defined by its spelling variation (not only between regional scriptae but also inside them), and its rich morphology. As such, the same word with different flexions can be written in Table 2. Example of results and ground truth after and before word segmentation Transkribus Entendre la glorieuse passion saĩt aun piere lapos stre de son. " mart _ yre qil rechut por nostre sign: si est la ue rite del escrip ture Transkribus þ Boudams Entendre la glorieuse passions aĩt a un piere la posstre de son ." mart_ ye qil rechut por nostre sign: si est la uerite del escripture Kraken (No Space) þ Boudams Entendre la glorieuse passions at piere la positre de son. manty Kre qil rechut por nostre sign: si est la uerite del escriptire Correct Entendre la glorieuse passion saït piere l apostre de son mart _ yre q il rechut por nostre sign$ si est la uerite de l escripture Table 3. Output of abbreviations resolutions and normalizations in Table 2 content with ground truth Transkribus Raw entendre la glorieuse passion saint aun piere lapos stre de son. martyre q'il rechut por nostre signor: si est la ve rite del escrip ture Transkribus þ Boudams entendre la glorieuse passions aint a un piere la posstre de son. martyre q'il rechut por nostre signor: si est la verité del scripture Kraken (No Space) þ Boudams entendre la glorieuse passions art piere la positre de son. mantyere q'il rechut por nostre signor: si est la verité del escriptire Correct Entendre la glorieuse passion saint piere l apostre de son mart _ yre q il rechut por nostre signeur si est la uerite de l escripture In bold, the same word "saint" in the ground truth and its different versions in the various generated datasets.  different fashions. In the Pinche Dataset, which represents 27.34% of the whole corpus to be lemmatized in Transkribus, 6 the verb avoir (to have) has fiftyseven different spellings, the pronoun il seventeen, the nouns emperëor eight, the adverb tout (all) fourteen, the adjective saint eleven: for example, 'compagnie' can be found written as compagnie, compaignie, compaignies, compaigniez, conpagnie, conpaignie, and conpaignies.
Pie is a lemmatizer specifically designed to deal with historical languages with such traits as those found in Old French. We trained a lemmatizer on a dataset of approximately 500,000 lemmatized tokens which were taken from the Chrestien corpus (Kunstmann, 2009), the Geste corpus (Camps, 2019), the Institutes (Olivier-Martin et al., 2018), the Lancelot (Ing, in progress) and the Wauchier (Pinche, in progress) dataset. 7 The overall model had 96.38% accuracy on the test corpus comprised of 48,317 tokens, punctuation included .
The final result is a lemmatization and pos-tagging of each document. Error accumulation through successive postprocessing steps, and noise in the source HTR dataset leads to a dataset with varying quality, although some parts of the document, if not most, are treated with satisfying results.
To evaluate the impact of all pipeline steps on lemmas and POS 3-grams frequencies, in a case where the total number of words can differ, we evaluate the differences with the ground truth in the following equation: where tf(A i ) is the absolute term frequency of feature i in document A to be evaluated, and tf(B i ) it's frequency in document B, the ground truth. We also provide the ratio of lemma or POS 3grams, which are present in A but not in B and vice versa (labeled as difference in Table 6). Difference of OCR against Gold is higher than its counterpart as a result of noise accumulation in the pipeline (it has more token).
Given the previous results, we kept only the Transkribus HTR þ model output and its variations (through Boudams; through Pie for lemmatization and POS-tagging). Each figure states specifically which version of the Transkribus pipeline output it uses.

Stylometric Analysis
The stylometric analysis has to address several challenges, resulting both from the nature of the texts and from the data acquisition pipeline: the short length and anonymity of most texts; the noise in the authorial signal caused by successive errors or innovations in the tradition of the texts (variants) as well as the amount of spelling variation; the noise (and potential biases) resulting from the data acquisition pipeline. Even though stylometric methods have shown to be relatively resilient to a-simulated or observedmoderate amount of noise (Eder, 2013;Franzini et al., 2018), devising a stylometric set-up to partially eliminate or circumvent it is still likely to lead to more reliable results.

Unsupervised analysis of short anonymous texts
The texts from the manuscript are, on average, quite short, with a median value of 3,539 words, and extreme values of 298 and 18,971 (Fig. 6). Texts that are too short create a problem of reliability, as the observed frequencies may not accurately represent the actual probability of a given variable's appearance (Moisl, 2011). To limit this issue, we removed texts below 1,000 words, a relatively low limit when compared with existing benchmarks (Eder, 2015(Eder, , 2017, but motivated by the necessity to not exclude too many texts. Given the short length of the texts and the sparsity caused by noise, we implement a procedure to select for analysis only those features that satisfy a criterion of statistical reliability. In this, we follow the procedure suggested by Moisl (2011), in the implementation already used by Cafiero and Camps (2019). To summarize it, features are only retained if they match the desired confidence level and margin of error even for the smallest text in the corpus. For each feature (e.g. the function word 'et'), the minimum text size n is calculated with where pÀ is the mean probability of the feature in our corpus; z, the confidence level, and e, the margin of error. We take z ¼ 1.645 to obtain a confidence margin of 90%, and e ¼ 2r, where r is the feature's standard deviation. Beforehand, to correct for normality, we generate a mirror-variable (Moisl, 2011): where v j is the vector of the feature j, max v and min v are the maximum and minimum values in v j , and v ji is the relative frequency of j in a sample i. This mirrorvariable is concatenated with the original variable in order to compute n. If n is superior to the length of the smallest text in our corpus, we then exclude the feature from further analysis. Because most of the texts of the manuscript are anonymous, we follow an unsupervised approach to their analysis (Camps and Cafiero, 2013;Cafiero and Camps, 2019), using agglomerative hierarchical clustering with Ward's criterion (Ward, 1963), guided by its ability to form coherent clusters.
The metric and choices of normalization are also an important parameter, one to which much attention has been devoted (Evert et al., 2017;Jannidis et al., 2015).
Following the benchmark by Evert et al. (2017), we chose to use Manhattan distance with z-transformation (Burrows' Delta) and vector-length Euclidean normalization.

Noise reduction and choice of features
In the form in which they have reached us, medieval texts are noisy, in respect to the authorial signal. The perturbations in the authorial signal can be inherent to the data, as is the case with the successive errors and modifications made by generations of scribes in the successive copies of the works. Such is the case for substantive variants, but it also affects the linguistic form of the texts itself: stratified, it can contain spellings and other linguistic features originating from the dialect and regional scripta of any and all of the successive scribes, creating a very important and heterogeneous spelling variation. The choice of working with the texts of a single manuscript was already guided by the aim of limiting this kind of noise, but is not, in itself sufficient. For this reason, further normalizations, such as abbreviation expansion and lemmatization, were included in the data acquisition pipeline. Yet, even though it achieves satisfying accuracy at each step, the pipeline itself, through the residual presence of errors, introduces noise as well. Moreover, as the training corpora for each algorithm were not selected by a perfectly random process, they introduce the risk of potential biases.
To handle these risks, we chose to retain raw as well as normalized data for the analyses, using three feature sets: (1) Character: n-grams from raw HTR data (baseline). (2) Functors: pseudo-affixes from expanded data, function words and POS n-grams. (3) Words: word forms from expanded data and lemmas.
The aim of feature set 1 is to avoid biases resulting from the pipeline, and for this reason to use the initial raw output of the Transkribus HTR model, excluding all further normalization steps. Previous research has shown that character n-grams could be a way to circumvent issues due to noisy OCR output, especially when compared with most frequent words (Eder, 2013). Following existing benchmarks (Stamatatos, 2013), we choose n ¼ 3 for our character n-grams.
Because it fits our case closely, we consider this feature set to be our baseline, and complement it with two others.
Feature set 2 is built to capture functors, that is, grammatical morphemes (Kestemont, 2014), while circumventing the noise due to scribal variation of paleographic and graphematic nature. Functors have long been-and often still are-considered the most effective feature for authorship attribution, because they capture unconscious individual variation, while being less dependent on generic or thematic context. In this feature set, we used expanded data to extract pseudo-affixes, that is, a specific kind of n-gram that has been shown, along with punctuation n-grams, to outperform others (Sapkota et al., 2015), perhaps because of its ability to capture grammatical morphemes. Since there is no authorial punctuation in our case, we extracted four kinds of pseudo-affixes n-grams: 'prefix' and 'suffix' (the n first or last characters of words of at least n þ 1 characters), as well as 'space-prefix' and 'space-suffix' (the interword space with the n-1 characters preceding or following it), with n ¼ 3. For instance, for 'annoncier', we extracted '^ann', 'ier$', 'an', and 'er_'. We also included function-words. Function words are commonly recognized as one of the most effective features (if not the most) for authorship attribution (Argamon and Levitan, 2005;Koppel et al., 2009;Kestemont, 2014). Finally, we added information on the morpho-syntax of the texts, by extracting Part-of-Speech 3-grams such as 'PRE DETdef NOMcom' (preposition, definite article and noun, e.g. 'a la corone'). POS 3-grams have sometimes shown to be a quite effective feature for cross-topic authorship attribution (Gó mez-Adorno et al., 2018). In this case, multiplying the measurements by concatenating three types of features in this set is done to help deal with short noisy texts and improve reliability.
Feature set 3 is constituted because-despite the broad consensus on the use of functors-some recent studies seem to advocate the use of longer word lists as a feature for authorship attribution (Evert et al., 2017). Using words' forms is, in our case, both interesting, because it allows us to retain morphological information, and risky, due to the extent of spelling variation, attributable to the scribes. To account for that, we also include lemmatized words, which, in turn, are dependent upon the accuracy of the lemmatizer.

Results and cross-validation
The results on the three feature sets are included in Fig. 7, HC1. Our baseline result (Fig. 7, HC1, top) is also the one closest to Meyer's classification, often up to the ordering of the texts, though displaying a few differences (six out of fifty-nine texts, concerning mostly texts of B included with C). The results on feature sets 2 and 3, though keeping the same macrostructure, display some interesting variations with the inclusion of a mixed B/C subgroup within Meyer's A.
In order to get more insight into feature sets 2 and 3, we also give supplementary results on their components (Fig. 8, HC2). This can be useful since differences on clusterings based on separate aspects (e.g. morphosyntactic sequences versus function words or affixes) could reflect differences in groupings when alternative perspectives are taken on the language; or punctually yield useful information on some texts, as we vary the lens with which we observe it.
In order to check the robustness of our results, we give, for each analysis shown in Fig. 7, HC1 and Fig. 8, HC2, four indicators: † The number of analyzed features and the agglomerative coefficient, that, taken together, give an indication of the quality of the clustering; † The cluster purity of the groups (with k ¼ 5) as compared with Meyer's Hypothesis on A, B, and C, and Wauchier's alleged texts; † The cluster purity of the groups when compared with our baseline (results on feature set 1).
These figures are given in Table 7.

Classification resistant texts and volatility
Eder (2017) showed that, whatever the variation of sample length, some texts were never correctly attributed (at least when a given feature set is used) and suggested to measure the diversity of attributions of individual texts-what we call volatility-to help identify these cases when the authors are not known. Following this, one could hypothesize that the presence of classification resistant (or volatile) texts is to be expected in a sufficiently large corpus.
To measure the volatility of any individual text, in the context of unsupervised analysis, we wish to measure the stability or volatility of its neighborhood. We devise a specific metric V i that aims to compute the volatility of the neighborhood of a specific text i in the groups of which he is a member in all the clusterings performed. Let ðG j Þ j 2 J ; i 2 G j be the family of sets of which i is a member, where J is the total number of clusterings performed. We can then construct a set X, containing all unique texts fx a . . . x n g occurring in at least one set of the family (G j ), X ¼ fx j 9 j 2 J ; x 2 G j g. For each x, the family (G j ) can be split in two subfamilies, ðA k Þ k 2 J ; x 2 A k and ðB l Þ l 2 J; x 6 2 B l . We then compute a global volatility index as follows: Since we normalize it by the total number of elements in X, this index is limited by [À1; 1], where À1 would indicate a perfect volatility (all sets with no member in common) and 1 perfect stability (all sets with the same members).
The results of this procedure are given in Table 8. We notice that the texts attributed to Wauchier are the least volatile, while there is a small group of volatile texts achieving a score <0.5. Another indication yielded by this index is that volatility is not (or almost not) due to variation in sample length (Fig. 9). The small relationship, on the edge of significance, between text length and volatility that we observe when looking only at the supplementary analyses disappears totally when we look at the reference analyses. This could be an indication that the strategy we have adopted, of concatenating several measurements to increase reliability on short texts, is working.

Controlling for pipeline bias
To control for the presence of bias due to the training data used by the pipeline, we perform the same set of analyses on the data obtained with models trained on alternate data (TNAH corpus) or with different tools (Kraken), and compare it with our analyses displayed above. The results are displayed in Table 9. Even though the models achieve quite different accuracies, the results are not significantly different and show a particular stability on the three main analyses (mean CP of 0.95 and 0.92, Table 9). The results based on a change of training corpus are actually closer than the one obtained with the same corpus but a different HTR software though the difference remains small.
An inspection of the features most correlated to the nine clusters of each of the three reference analyses (see Supplementary Material, Table A and Fig. A-C) shows a wide range of features, of different nature, that is, variation at the graphematical, morphological, or syntactical level. It also shows that, in the case of presumably less grammatical and more thematic features such as words or lemmas, thematic interference can come into play. The variation in the use of these features can sometimes be attributed to diachronic or diatopic variation, while in other occasions, they seem to be characteristic of a given idiolect, such as Wauchier's. For instance, the trigrams 'que' ('that') may be the mark of a more common subordination in C and the evidence of a more recent syntax in these texts, while the trigram 'qil' ('that he') in Wauchier's texts is certainly the sign of a recurrent duplication of subordination when there is an imbricated subordination, a feature of his writing style. The use of 'com' ('as') is also characteristic of Wauchier, while the use of the personal pronoun 'tu' ('you') shows a more contrasted situation. Syntactic sequences such as CONcoo ADVgen VERcjg display an evolution that could be perceived as chronological. Finally, we also have more problematic connections with thematic words, such as «apostle» in collection A (see Appendices).

Interpretation of the Results
The manuscript fr. 412 allows us to control the results of our approaches by checking the unity of the Wauchier de Denain collection 8 in the stylometric trees. On the three reference analyses (Fig. 7, HC1), the Wauchier group, with the adjunction of the Life of saint Lambert, is the most clearly distinguished group. This same  configuration, with or without Lambert, is also visible on all supplementary analyses, except the ones based on POS 3-grams and lemmas (Fig. 8, HC2). These two analyses also achieve low agglomerative coefficient, given their number of features, and low cluster purity, both in comparison with Meyer's classification and with our baseline; facts which advocates for considering them as outliers, with low reliability.
Moreover, in their globality, the results seem to agree with Meyer's hypothesis, with CP from 0.83 to 0.9 for the reference analysis (0.71-0.85 for the others, Table 7). This is particularly obvious for our baseline (see Fig. 7, HC1, top, reproduced here as Fig.10), that represents the manuscript fr. 412 as a successive addition of collections A, B, and C, which appear in separated branches. This can also be observed in the other trees, even if in a slightly  noisier fashion. For this reason, Paul Meyer's hypothesis seems to be confirmed by our results. Nonetheless, they can be nuanced or made more accurate in a few cases, as we will see.

Volatile texts and exceptions to
Meyer's classification  and Assumption-Antichristi (42 and 60) A subgroup mixing texts from Meyer's B and C can be observed in our results (Fig. 7, HC1). It contains Saint Catherine Life, Saint Andrew Life and Miracles, Saint Andrew Passion (n. 43-45), as well as the Assumption of Our Lady and, but not always, Antichristi (n. 42 and 60). 9 In the trees, this subgroup is sometimes included in C, sometimes in A. The Assumption and the Antichristi were identified by Paul Meyer as texts from collection B and the second one was assumed to have certainly been published first in an autonomous way with some others texts: Saint Patrice's Purgatory, Julian and Brendan Lives (Meyer, 1906, p. 405).

Clément and Patrice
Clément and Patrice are integrated in collection C (as opposed to Meyer's B) in almost all analyses. A precise interpretation of this remains to be found, but one can note that Patrice has been supposed by Paul Meyer as having been published first as part of a preexisting libelli and autonomously from collection B (or C) with Antechristi.
However, given the very small amount of texts erratically classified, and given the pre-existing difficulties faced by Paul Meyer concerning some of these, we do not expect this to contradict our former conclusion. The apparent volatility of Marc and Jacques can partially be disregarded, as they appear to only switch subgroups while remaining inside A. 4.2.2.1 Collection A Paul Meyer identified collection A as a group of twelve texts. It is most apparent on the baseline analysis (Fig. 7, HC1, top). From our results, it can be refined in two subseries A1 and A2 (Table 10). Thematically, this regroupment makes sense. On one side, we find major apostles, creators of Christian Church, with two sequences of three and Table 9. Cluster purity of the analyses replicated using models trained on the TNAH corpus or with Kraken, with regard with the analyses presented in Fig.7, HC1  To conclude about the collection A, in contrast to Meyer's hypothesis, Saint Longin Life is almost never grouped with any of the aforementioned series, but is nearby, being clustered as the first element of B.

Collection B
Collection B contains nineteen texts, thematically centered around martyrdoms. Of those, sixteen are sequentially found in MSS BnF fr. 412, the three others being Our Lady, Antichristi, and Saint Patrice. The latest are nearly never grouped with the main body of B in the trees.
We can observe a strong group composed of twelve Lives: Christophe, Agatha, Lucy, Agnes, Felicity, Christine, Cecile, Sixte, Laurent, Hippolyte, and Pantaleon Lives are clearly gathered (Fig. 7, HC1). We can add to them Georges, Vincent and Sebastien Lives, but saint Clement Life is always missing. Thus, globally, the sequence of the manuscripts is reflected in the classification.
Moreover, in all the selected trees, the Life of Saint Longin classified as A by Paul Meyer is gathered with texts from collection B. Furthermore, thematically, Saint Longin's Life isn't coherent with a series of saint apostles, given the fact that he is a martyr. In manuscript BnF fr. 412, given the order of the compilation, we can considerate Saint Longin Life as the last text of collection A or as the first of collection B. Looking at the manuscripts' tradition, in fact this life is mixed with the apostle lives just once in the manuscript BnF, nouv acq. fr. 23686, which happened to be the one Paul Meyer used as his prime material for studying collection A. So, regarding our results and manuscript tradition, it seems more accurate to classify this life to saint Martyrs within collection B.
In the light of this hypothetical classification, we can observe two subcollections (Table 11): A micro-series B1a is composed by Saint Longin, Saint Sebastien, Saint Vincent, Saint Christopher Lives.
In addition, in B, a micro-series B1b of saint women Lives appears: Saint Agatha, Saint Lucy, Saint Agnes, Saint Christine, and Saint Cecile. Those Lives are about virgins and also close in the manuscript tradition: the first three texts of the series are often gathered together, as are the last three. There are also textual links between them. One explanation for the proximity between Saint Agatha and Saint Lucy can be the fact that the last seems to be in the continuation of the first's story. 10 Indeed, at the end of her story, Lucy defines herself as an heir of Agatha: Aussi com la cites de Cathenense est secorue et aidie par seinte Agathe ma seror, aussi sera ceste citez aidie et socorue par moi, se uoz auez foi et creance en nostre Signor. 11 Just as the city of Catania was rescued by the help of St. Agatha, my sister, this city will be rescued by my help, if you have faith in our Lord.
We can add that both of them come from Sicilia: Agatha from Catania and Luce from Syracuse. There are also some links between the Lives of Saint Lucy and Saint Agnes. The Life of Agnes starts at Rome where the Life of Saint Lucy stopped and they both have to face the threat of a spurned lover who wants to send them to a brothel. We can also note that Saint Christine, as Saint Agatha, is one of the four patrons of  (4) Passion, translation and miracles of Saint James the Greater (5) Saint Matthew Life (6) Life of saint Simon and Jude (7) Saint Philip Life (8) Life of Saint James the Minor (9) Saint Bartholomew Life (10) Saint Marc Life (11) a This text is amongst the one removed due to length below 1,000 words. Palermo in Sicilia and that the thematic of the snatched breast, iconic for Saint Agatha, can also be found in Saint Christine's Life. The reason behind the adjonction of Saint Cecile is more obscure: there might be a redundant theme around family and conversion. Finally, we can add that the Lives of this group are amongst the least volatile in our corpus after the Wauchier de Denain collection (Table 8).
However, both microseries B1a and B1b can be grouped together as a collection B1 in five of the selected trees. The rapprochement seems logical from the point of view of literary construction because it builds a collection with, on one side, five Lives of men martyrs, and on the other side, six Lives of women martyrs. A stylometric study cannot determine the order of apparition.
Finally, Paul Meyer, during his work about the different hagiographic collections, has seen that, in the collection B, the series Sixte-Laurent-Hippolyte was frequent in the manuscript tradition (Meyer, 1906, p. 495). This reunion appears in our three analyses. Furthermore, the dendrogram based on function words (Fig. 8, HC2, top-right) links Laurent and Hippolyte with Pantaleon. This addition isn't in contradiction with the tradition: collection G 12 contains them sequentially in three of its four witnesses, and in collection C (three manuscripts) saint Pantaleon's Life is only separated by Saint Lambert's Life from the other ones. As such, it is possible that their gathering confirms a predating series. 4.2.2.3 Collection C Collection C contains twenty-two texts without any apparent major theme.
Collection C seems to have two major series, one constituted by Wauchier de Denain's Seint Confessor, and the other one containing all the others texts (Table  12).
First, we can see that the Lives of Saint Maur and Placide and Saint Benoit's Translation 13 form a series C2. The translation and the Life of saint Maur are grouped in our three analyses, and the Life of Saint Placide is also close to or part of the group. Those Lives have a thematic unity: the translation of the body of Saint Benoit, followed by the Lives of his disciples, Saint Maur and Saint Placide.
Another series C3 appears in all three analyses (Fig. 7, HC1): Saint Marguerite, Saint Pelagie, Saint Euphrasie and probably also The Life of Saint Mary the Egyptian. Surprisingly, we can extend this series to a subseries C3b containing one, perhaps two, men's lives: the Life of Saint Mamertin, always present as well, while the Life of Saint Simeon is more punctually associated with this group, when considering only function words or function lemmas.  Finally, this study has revealed an astonishing rapprochement between Li seint Confessor and the Life of saint Lambert. Normally, Saint Lambert's Life is part of collection B, but following our preceding analysis, Saint Lambert's does not fit in any group of the collection. In fact, we have to look at some of the supplementary analyses to find results where it is not associated with Wauchier's works (POS 3-grams, function lemmas, and lemmas, Fig. 8, HC2), all potentially influenced by the nature of the training corpus. However, from a close reading perspective, it is difficult to affirm the authorship given the fact that Saint Lambert Life does not possess any of the usual distinctive marks of Wauchier de Denain's style such as verses in prose, signatures or vernacular translation of Latin citations. Moreover, there is no reference to Philippe of Namur, Jeanne of Flanders or Roger squire of Lille, Wauchier de Denain's patrons (Douchet, 2015). There is also no evident Latin common substrate between the lives of Li Seint Confessor and Saint Lambert Life. The only common point which can be found is in the localization and the theme. Liege is close to the Namur area. 14 Saint Lambert is an important saint, a bishop linked to power, having contact with Pepin, king of the Franks. The chosen version is the one with a positive representation of royal power. So, we are in the presence of a bishop, ally of power, like Saint Martin or Saint Martial. On the other hand, one possible hypothesis regarding a potential Wauchier's authorship is that Saint Lambert Life is an early text, where the author erases himself and stays in the role of a simple translator. Consequently, without any easy proof of classification, further study of Saint Lambert Life's relationship with Wauchier's work should be done.

Conclusion and Further Research
The aim of our approach was to conduct an experiment that went beyond the HTR stage and to offer a complete pipeline to acquire and analyze medieval data, using a hybrid approach of human and artificial intelligence to answer a research problem, in this case the compilation mechanisms of the legendary C. Using machine learning, we are able to acquire and process textual data from medieval manuscript images, and then submit it to a stylometric analysis setup that seems able to deal with noisy data. The application of this procedure offers perspectives for ancient or medieval Cultural heritage data, and could be extended to other material.
From a methodological and stylometric perspective, the challenge was, in a context where supervised analysis was not an option, to deal with short texts, 15 whose data were noisy in two regards: first, because of the noise generated during text recognition (HTR) and further processing steps; secondly, because of the noise to the authorial signal inherent to medieval data (spelling variation, variants, etc.). From our observations, our baseline that involved character ngrams with raw HTR data (already suggested as the most adapted to noisy OCR or HTR data in previous studies) can still be considered a very efficient procedure. Our attempts to suppress noise due to spelling variation by using other types of features such as lemmas or POS 3-grams, though offering alternative insights into the data, do not seem yet able to surpass it significantly, perhaps because of the cumulative error rates for each processing step. On the other hand, concerning the shortness of the texts, our results seem to agree with the notion that it is possible to analyze texts below 3,000 words; more specifically, by using less sparse features, such as characters 3grams, or by concatenating different features, different views on the same text, we seem to achieve stable results, independent of the variation of sample length in our corpus.
From a thematic perspective, on the whole, our results confirm (or fail to disprove) Meyer's hypothesis regarding the constitution of Old French legendiers. They also bring to light some new facts, such as potential subseries that were not previously identified, and raise questions about Saint Longin's life, that, we believe, can be considered part of collection B instead of A, and the life of Saint Lambert, whose possible attribution to Wauchier de Denain is deserving of further investigation. The whole process could be applied to the two other witnesses of the legendary C, but also to legendaries, such as those of the family G (3 witnesses) which are later manuscripts with a more complex tradition and whose compilation is less conservative of the original blocks in order to further strengthen our conclusions and sustain the analysis. 16 Finally, we hope that our approach can motivate new investigations, using computational humanities, on philological and historical holistic hypotheses formulated in the nineteenth century, that still sometimes form the basis of our understanding of the sources. By bringing together the work of the founders of our fields, such as Paul Meyer, and novel computational methods, we can hope to achieve progress in many areas, and perhaps more specifically in those that are left out of the literary canon envisioned by many close reading studies.