Abstract

New Testament textual critics have for decades calculated the similarities between the manuscripts in a similar manner, using collations and variation units. This conventional methodology requires enormous amounts of time and manual work. Here is proposed a new method that does not require these preprocessing steps, enabling the establishment of quantitative relationships using manuscript transcriptions only. This is achieved by applying a technique called shingling, where the manuscript transcriptions are turned in a computerized manner into smaller pieces called tokens or k-grams. Then, a string metric is used to calculate the similarities between the tokenized strings. This method is efficient, meaning that it allows critics to consider all textual evidence in each manuscript tradition. At the same time, it returns similarity values that are compatible with those of conventional approaches.

1 Introduction

Until the invention of printing with movable type in the 15th century, all texts were reproduced by the process of manual copying. Letter by letter, scribes duplicated the texts, but the process was not perfect and changes occurred, ranging from simple scribal errors to intentional alterations of theological significance. Textual criticism studies these changes, identifying variations in each manuscript tradition and determining the relationships between the variants and the manuscripts. The goal of the discipline is to recover a text as close as possible to that of the author.1

In studying the relationships between manuscripts (henceforth MSS) of the New Testament (NT), a quantitative approach suggested by Ernest C. Colwell and Ernest W. Tune (1963) has proven to be useful. This is known as the ‘quantitative method of textual analysis’ (Metzger and Ehrman, 2005, p. 234). The analysis is conducted by collating the MSS (comparing them side by side), separating sequences of texts where there are variations among the MSS, called variation places or units of variation, and recording how many times the MSS agree within these variation places.2 The instances of agreements are converted into percentages by dividing the number of agreements and the number of all variation places between pairs of MSS (Colwell and Tune, 1963, pp. 25–8). This measurement of similarity can be formulated as:

Following the lead of Colwell and Tune, various critics have adopted or refined this approach in order to classify MSS (Fee, 1968; Hurtado, 1981; Ehrman, 1986; Geer, 1994; Osburn, 2004; Racine, 2004). Others have utilized computers while using the same similarity measure to conduct multivariate analysis, such as hierarchical clustering (Griffith, 1969; Thorpe, 2002; Finney, 2010; Donker, 2011; Finney, 2018).3 Other similar grouping approaches, such as non-negative matrix factorization, use approximate techniques to establish the quantitative relationships between MSS (McCollum, 2019).

The quantitative approach is also used in computer-assisted stemmatology, not to group the MSS as such but to reconstruct textual histories, using treelike patterns.4 Stemmatology has become an important part of NT textual criticism and is widely used in upcoming critical editions (Wasserman and Gurry, 2017, pp. 5–13). This is due to the groundbreaking work of Gerd Mink, who developed an approach known as the CBGM (Mink, 2004, 2008). It uses the described quantitative analysis (called pregenealogical coherence in CBGM) to connect variants and witnesses to one another.5 A high agreement level (e.g. above 90%) between a pair of witnesses results in good pregenealogical coherence, which is taken as an indication of their close relation (Mink, 2008, pp. 144–7).

However, all the methods described take a lot of time, which limits the number of MSS that can be considered. This is since they all demand a lot of preprocessing work (see Section 3), which is a problem that need to be addressed. After all, NT comprises the world’s largest manuscript tradition in terms of surviving MSS. For example, the Acts of the Apostles, which is the 5th book of the NT, has survived in 530 Greek MSS (the original language of the NT).6 Though this is not the largest tradition in NT (the Gospel according to Matthew has survived in more than 2,000 MSS), no survey has been able to consider the tradition in its entirety.7

A new method is proposed here which operates on manuscript transcriptions (texts typed into a file), allowing all MSS to be taken into account while returning similarity values that are compatible with those using conventional preprocessing steps. First, the texts are transcribed into a file and automatically tokenized (divided into smaller pieces) using a technique called shingling, which is commonly used, for instance, in approximate string matching (Ukkonen, 1992) and natural language processing (Goldberg, 2017, p. 75). Second, the quantitative relationships between MSS are established using the Sorensen–Dice coefficient (SDC), which measures the similarity between two sets of data (Dice, 1945; Sorensen, 1948). This coefficient is widely used to estimate, for instance, genetic similarities (Kosman and Leonard, 2005; Dalirsefat et al., 2009).

This approach is tested using the manuscript tradition of Acts. A new program called Relate was developed to conduct the analysis. The source code with detailed instructions of how to use the program is readily available at https://github.com/PasHyde/relate.

2 The Data

The dataset used is composed of fifty-four Greek MSS, which contain Chapter Five of Acts. This is the same data that were used by Pasi Hyytiäinen in his phylogenetic analysis of Acts (Hyytiäinen, 2021, pp. 7–14), and available at: https://github.com/PasHyde/relate/tree/main/Data. Here one finds the transcriptions of the MSS with the encodings (variation places and variants denoted using e.g. numbers). The transcriptions form the main data of the analysis at hand and the encodings, prepared by Hyytiäinen, are used as a comparison dataset since they were generated via conventional preprocessing steps. The survey at hand aims to bypass most of these steps, which means that the encoded data are an excellent point of comparison. The agreement rates of CBGM are also considered due to the important role the method currently has in in the field. These are available for the Acts at https://ntg.uni-muenster.de/acts/ph4/.

3 Conventional Preprocessing Steps of the Quantitative Analysis of MSS

3.1 Transcribing and collating the texts

Conventional quantitative analysis of MSS demands large amounts of preprocessing work, which is described in the following. The texts need first to be transcribed, which often entails regularization of the spelling and transcribing texts without punctuation. This can be seen as the first weighing of the variants, meaning that critics weigh the variants and decide, which are significant for the analysis, and which are neglected (on the second weighing, see Section 3.2).8 Critics usually discard spelling differences and punctuation since these variants are of a linguistic nature and are unlikely to contain relevant information about relations between MSS (Mink, 2011, p. 143; Howe et al., 2012, p. 149).

Then the texts are collated. Previously this had to be done manually, but the process can now be conducted using the software CollateX (Dekker et al., 2015) available at https://collatex.net/.9 However, the use of CollateX is not as straightforward as one might expect. A critic may judge several shorter variation units identified by CollateX as comprising one unit or vice versa, redividing the variation units accordingly. A simple example demonstrates the issue:

Table 1 shows the collation table CollateX automatically returns from the example texts. The text1 and text3 have longer texts with only one change (red → brown), but text2 is missing the first half completely. CollateX automatically divides this example in four variation units. If one uses this collation table to calculate the similarities between the texts, the similarity between text1 and text3 would be 75% (three units divided by four units). The similarity between text2 and the other two is 25% (1/4). However, it is perhaps not wise to record each missing word of text2 separately as this might add weight to a single change (Howe et al., 2012, p. 61). Hence, a critic may wish to counterbalance this and combine the two units in the middle, which results to three variation units. The similarity values would then be 66.6% (2/3) and 33.3% (1/3) between the texts. In other words, a critic needs to inspect each variation unit after the automated collation is done by CollateX, deciding whether the variations units are justified or not.10

Table 1

An automated collation table generated by CollateX

text1the quickredfoxjumped over the lazy dog
text2jumped over the lazy dog
text3the quickbrownfoxjumped over the lazy dog
text1the quickredfoxjumped over the lazy dog
text2jumped over the lazy dog
text3the quickbrownfoxjumped over the lazy dog
Table 1

An automated collation table generated by CollateX

text1the quickredfoxjumped over the lazy dog
text2jumped over the lazy dog
text3the quickbrownfoxjumped over the lazy dog
text1the quickredfoxjumped over the lazy dog
text2jumped over the lazy dog
text3the quickbrownfoxjumped over the lazy dog

While the collation table and the variation units are inspected, singular readings (variants found in individual MSS) are often eliminated from the quantitative analysis, which arises from the manner in which a variation unit is defined. Following Colwell and Tune, practically all subsequent quantitative analyses (except those of CBGM) adopted an idea that a variation unit is a segment of a text containing at least two variants supported by at least two MSS (Colwell and Tune, 1964, p. 254; Epp, 1976, p. 157; Hurtado, 1981, p. 11; Osburn, 2004, p. 178).11 The logic behind this definition is, e.g. that singular readings tell nothing about the relation between MSS, since they leave a manuscript unrelated to others (Colwell and Tune, 1963, p. 27). In CBGM, singular readings are also considered since the manuscript that preserves it may be the only survivor of a once-strong strand of transmission (Wachtel, 2000, pp. 34–5).12 The definition of a variation unit in CBGM is then a segment of a text where at least two variants exist, supported by at least one manuscript (Mink, 2004, pp. 27–8).

On the other hand, CollateX is not well suited for collating large number of MSS with some length. Table 2 summarizes execution times of CollateX while increasing the size of the data. The test was conducted using the fifty-four MSS of Acts 5, which was then copied several times. The chapter in question contains forty-two verses, and depending on MSS, approximately 780 words, and 4,800 letters.

Table 2

The execution times of CollateX

ChaptersTime (min)
121
2110
4468
84,340
16
ChaptersTime (min)
121
2110
4468
84,340
16
Table 2

The execution times of CollateX

ChaptersTime (min)
121
2110
4468
84,340
16
ChaptersTime (min)
121
2110
4468
84,340
16

As can be seen from the table, the speed of CollateX rapidly decreases when the length of the text increases. Up to four chapters, the execution time is somewhat manageable, but with eight chapters CollateX needed over 72 h to complete the collation. When the size of the data reaches sixteen chapters, the memory runs out.13 Thus, longer texts must be divided into chapters or even verses in order to ensure the speed (and accuracy) of CollateX (Dekker et al., p. 461). CollateX is an excellent tool for shorter and smaller number of MSS, but it is perhaps not the most efficient option for preprocessing large number and longer MSS for the quantitative analysis.

3.2 Selecting genetically significant variants and encoding

Often the next step is to perform a second weighing of variants, which at this point refers to selecting genetically significant variants. Though these variants can be understood differently, they usually refer to variants that did not rise accidentally (Colwell and Tune, 1963, p. 26; Fee, 1993, pp. 67–8; Geer, 1994, p. 26). In CBGM, all variants are considered except most spelling differences; hence, this phase is not included in the analyses of CBGM (Wasserman and Gurry, 2017, pp. 38–9). There are some differences concerning how and when the selection is done. Colwell and Tune insisted that genetically significant variants should be selected before counting the agreements, meaning that they made one calculation, considering the genetically significant variants only, excluding all variants which may have arisen accidentally (Colwell and Tune, 1963, p. 26). Starting from Fee, critics began to weigh variants and calculate similarities in two stages. In the first stage, critics use all variants in their calculations except most spelling differences, itacisms, nonsense readings, etc. In the second stage, all other variants are excluded apart from the genetically significant variants. The second stage, then, functions as a way to clarify the manuscript relation that was noted in the first stage (Fee, 1968, p. 5; Geer, 1994, pp. 6–7; Hurtado, 1981, p. 11). The underlying idea of weighing of variants in this manner is that one should base the relationships on significant changes, such as large additions or deletions (Fee, 1993, pp. 67–8). If one chooses to conduct the second weighing of the variants, this must also be done manually since it is based on the judgments of a critic concerning the significance of the variants (Donker, 2011, p. 204).

In computer-assisted analyses, the variation places must often be encoded using binary (sequence of 0s and 1s) or multistate (0, 1, 2, 3, etc.) encoding schemes, enabling programs to process the data (Finney, 2018, p. 17). Here each variant (in each variation unit) is denoted using different numbers and in case two MSS happen to include the same variant, they receive the same number.

3.3 The starting point of a new methodology

All the preprocessing steps described take time, which limits the numbers of MSS (and variation places) that can be included in the analysis. Nevertheless, it seems that one should hold on to the transcribing process even if all other preprocessing steps are automated. All the information included in the MSS is not equally useful in establishing the quantitative relationships between them. Earlier Greek majuscule MSS (copied using upper case letters) do not contain punctuation marks whereas they are common practice in later medieval minuscule (lower case letters) MSS. Including punctuations marks to the analysis may distance MSS from one another despite having high textual agreement. Spelling, on the other hand, differs significantly even within the most carefully written MSS, thus revealing little information about the relationships between them (Howe et al., 2001, p. 123; Royse, 2008, p. 90). Hence, the starting point of the proposed new method and the software Relate is its capability to establish quantitative relationships between MSS using transcriptions only without the need for collations, variation units, or encodings.

Though this new approach increases the level of automation in the quantitative analysis, it must be underlined that it does not produce absolute results and nothing of the sort is suggested here. The program Relate operates on a specific set of rules using the data that are prepared by a user. The results that Relate returns depend on the settings that are chosen by a user, affecting the similarity values. However, these similarity values returned by Relate are compatible with those of conventional techniques but requiring much less time. Also, Relate is able to process much larger datasets compared to, for instance, CollateX.

4 Character- and Token-Based String Metrics

In computer science, letters in a text are referred to as characters and sequences of characters (texts) as strings. String metrics, on the other hand, comprise a type of metric that measures similarity (or conversely distance) between two strings. Some of these metrics (Levenshtein, Hamming, Jaro, etc.) are character based (i.e. they use direct character comparisons). Others, such the one used here, are token based, meaning that the strings are first divided into substrings, called tokens, shingles, or k-grams, and then a measure is used, such as the Jaccard or Overlap coefficient, for statistical comparisons. One of the benefits of the token-based approach is its time efficiency compared to the character-based approaches (Ukkonen, 1992, pp. 191–2).

Let us consider, for instance, the Levenshtein distance, which is defined as the smallest number of edit operations (insertion, deletion, and substitution) required to transform one string into another (Levenshtein, 1966). The Levenshtein distance tabulations, using a dynamic programming approach, can be seen in Table 3. Here we have two strings: ‘the fast brown fox’ and ‘the slow brown fox’. Starting from the top left, the algorithm moves one character to the right in the table and returns the smallest value from the three preceding cells (left, top left, and above) and does nothing else since there is no change (T, T). Moving on to the next character, T and H differ, and the algorithm takes the smallest value from the three preceding cells but now adds one (min(0,1,2) + 1 = 1). The character T is compared to every other character in the other string, then H, E, and so forth, one row at a time.

Table 3

The Levenshtein algorithm (dynamic programming approach)

graphic
graphic
Table 3

The Levenshtein algorithm (dynamic programming approach)

graphic
graphic

This way it iterates through the characters and at the same time tabulates the distance between the strings, which can be seen in the cells where the corresponding characters meet (bolded) in the table. The Levenshtein distance between the strings is seen in the last cell, bottom right (4). If we wish to convert the Levenshtein distance into a similarity value, we use the following formula:

The Levenshtein similarity between the strings seems rather good since we have four words, one being different (3/4 = 0.75); hence, 78% seems but slight overestimation. The problem with the Levenshtein calculations in Table 3 is due to the computations, which are costly, requiring O(n×m) time and space complexity. This is because every character is compared to every other character, which in our example means comparing strings of eighteen characters, requiring 18 × 18 = 324 comparisons, and in each character, allowed edit operations are considered while storing the data. Even though several optimizing techniques exist (Jokinen et al., 1996, p. 1,439), the use of Levenshtein becomes difficult when the data size increases.

Table 4 shows the execution times of the Myers algorithm (Myers, 1999), which is one of the fastest Levenshtein algorithms.14

Table 4

Average execution times per comparison of the Myers algorithm (Levenshtein)

Number of chapters12481632
Execution time (s)0.01140.02440.10670.23830.85373.2591
Number of chapters12481632
Execution time (s)0.01140.02440.10670.23830.85373.2591
Table 4

Average execution times per comparison of the Myers algorithm (Levenshtein)

Number of chapters12481632
Execution time (s)0.01140.02440.10670.23830.85373.2591
Number of chapters12481632
Execution time (s)0.01140.02440.10670.23830.85373.2591

Here two MSS of Acts 5 were selected and then copied multiple times while calculating the Levenshtein similarity. As we can see, this algorithm can be used rather efficiently in shorter texts, but with longer texts the execution time becomes unfeasible. If we compare all the fifty-four MSS of Acts in their entirety (twenty-eight chapters) by using this algorithm, one comparison would take approximately 2 s. Therefore, the overall duration, if constructing a symmetric similarity matrix (54 × 54 = 2,916 comparisons) as in Table 11, would be approximately 6,000 s or 100 min. While this is an enormous improvement to the speed of CollateX, the aim here is to find a methodology that can analyze much larger datasets than fifty-four MSS; hence, the execution time must be significantly lower than 2 s per comparison. However, the more difficult problem with Levenshtein (or any other character-based approach) is that the slight overestimation detected in the small example in Table 3 increases when strings become longer. By taking a random selection of four MSS from the comparison datasets, the one prepared by Hyytiäinen and that of CBGM, one can evaluate the values calculated by Levenshtein (Table 5):

Table 5

Comparing the similarity values of Acts 5

MSSSample datasetCBGMLevenshtein
01, 0390.0093.3798.47
03, 0564.9873.9383.24
05, 0855.6567.9183.28
MSSSample datasetCBGMLevenshtein
01, 0390.0093.3798.47
03, 0564.9873.9383.24
05, 0855.6567.9183.28
Table 5

Comparing the similarity values of Acts 5

MSSSample datasetCBGMLevenshtein
01, 0390.0093.3798.47
03, 0564.9873.9383.24
05, 0855.6567.9183.28
MSSSample datasetCBGMLevenshtein
01, 0390.0093.3798.47
03, 0564.9873.9383.24
05, 0855.6567.9183.28

The similarity values of Levenshtein are significantly higher, which may be due to the manner in which Levenshtein, or any other character-based approach, iterates through characters one by one. The editors of CBGM, for instance, differentiate verb forms from one another. In the Greek manuscript tradition of the NT, the verb ἀκούω (‘to hear’) is very common. In Acts 5:5, we find the verb in two different participle forms: ἀκούων (akūōn) and ἀκούσας (akūsās). These are taken to represent different readings by the editors of CBGM, that is, a disagreement. Levenshtein, on the other hand, here calculates 57% agreement (1 – 3/7) because of the identical stems of the verbs (ἀκού-). In addition, Greek contains myriads of different words which begin with the same preposition (παρα, περι, ανα, αντι, etc.): παραβαῖνω (‘to disobey’), παραβάλλω (‘to throw’). This means that character-based calculations detect similarities between words that are altogether different. These issues suggest that Levenshtein, or character-based calculations in general, may not be the best option for the analysis at hand.

5 The Method

5.1 Shingling and bigrams

Strings can be tokenized, for instance, into ordered or unordered set of words. An example of the latter is the bag-of-words (BOW) approach, where each document is treated as an unordered set of words. This approach works well in the small example in Table 3, since one of the four words is different (hence 3/4 = 0.75 = 75% similarity). However, it would be unable to detect changes in word order since BOW considers individual words only. Shingling, on the other hand, is a more informative approach, since it preserves better the original structure of the text (Goldberg, 2017, pp. 69–75). A k-shingle or k-gram is a consecutive word sequence (i.e. a substring of length k). If k = 2, every substring has the length of two words (2-gram or bigram), where each word depends only on the previous one (k – 1 words). These bigrams then serve as an approximation of the underlying text, which is an example of a Markov assumption, which assumes that the probability of a word depends only on the previous one (Goldberg, 2017, p. 106).

Besides these word-grams, shingling can be conducted using character- or letter-grams, in which case a bigram (k = 2) refers to a substring of two characters. White spaces also qualify as characters, which becomes clear in the following examples. Here, the size of the k becomes crucial:

  • k must be large enough, or most strings will share most k-grams

If one chooses a k-size that is too small, most sequences of k characters appear in most strings. If so, we could have strings whose shingle-sets have high similarity, yet the documents had none of the same sentences or even phrases. Word-bigrams can be used in the dataset of Chapter 5 of Acts, but when using letter-grams, one should choose a k-size between 5 and 10 (Leskovec et al., 2014, pp. 78–9).

Let us consider string1 in Table 3: ‘the fast brown fox’. As noted above, the analysis at hand is conducted using the program Relate, available at https://github.com/PasHyde/relate, which gives the possibility to choose between letter- and word-grams:

Word-bigram = ‘the fast’, ‘fast brown’, ‘brown fox’

Letter-bigram = ‘th’, ‘he’, ‘e ’, ‘ f’, ‘fa’, ‘as’, ‘st’, ‘t ’, ‘ b’, ‘br’, ‘ro’, ‘ow’, ‘wn’, ‘n ’, ‘ f’, ‘fo’, ‘ox’

As can be seen, each subsequent bigram partially overlaps with the previous one (hence the name shingling). These bigrams represent the underlying string where the order of the bigrams does not matter. The string may also be presented as:

Word-bigram = ‘fast brown’, ‘the fast’, ‘brown fox’

Letter-bigram = ‘fo’, ‘ f’, ‘e ’, ‘ow’, ‘th’, ‘wn’, ‘n ’, ‘ox’, ‘ro’, ‘ b’, ‘as’, ‘fa’, ‘st’, ‘br’, ‘t ’, ‘he’, ‘ f’

What matters is how many times bigrams occur between pairs of analyzed strings. The similarity between the tokenized strings can be calculated in different ways by treating each bigram as an element of a set. In other words, each string is converted to a set, which is defined as a collection of distinct elements (Cantor, 1915, p. 85), and each bigram to an element of that set. Since a set is a collection of distinct elements, all duplicate bigrams are ignored within a set.15

If one wishes to establish the similarity between the two strings in Table 3 using the shingling approach:

string1 = ‘the fast brown fox’

string2 = ‘the slow brown fox’

The strings are first converted into two sets, in this example using letter–bigrams. The program deletes one duplicate from each set: an ‘f’ -bigram (from the words ‘fox’ and ‘fast’) from set1 and one ‘ow’-bigram (from the words brown and slow) from set2, resulting in sets of sixteen bigrams:

set1 = {‘fo’, ‘ f’, ‘e ’, ‘ow’, ‘th’, ‘wn’, ‘n ’, ‘ox’, ‘ro’, ‘ b’, ‘as’, ‘fa’, ‘st’, ‘br’, ‘t ’, ‘he’}

set2 = {‘w ’, ‘ox’, ‘th’, ‘he’, ‘e ’, ‘n ’, ‘wn’, ‘ro’, ‘ow’, ‘br’, ‘lo’, ‘ b’, ‘fo’, ‘ s’, ‘ f’, ‘sl’}

Then a measure is applied to tabulate the similarity between sets. The most often used method for measuring the similarity between sets is the Jaccard coefficient, which is calculated by counting the number of elements shared between sets divided by the number of elements in both sets (Jaccard, 1901). This gives a value between 0 and 1. The more two sets have in common, the closer the value is to number 1 and vice versa. Jaccard is often referred to as the ratio of intersection over the union:

The program iterates through the sets, tabulating whether an element (letter–bigram) occurs in a set or not as demonstrated in Table 6:

Table 6

Tabulating the intersection over union

graphic
graphic
Table 6

Tabulating the intersection over union

graphic
graphic

This is computationally more efficient and faster than any of the Levenshtein algorithms, since every character is not separately compared to every other character while considering the edit operations: deletion, insertion, and substitution. The program needs to iterate through 2 × 16 bigrams O(n×m), but since n is small and a constant value, the execution time is in fact linear O(m) (see also Ukkonen, 1992, p. 192). This leads to the following execution times, using the same technique as in Table 4. In Table 7 below, letter-grams of k-size 10 and word-bigrams are used.

Table 7

Average execution times per comparison of the shingling algorithm

Number of chapters12481632
Execution time (s) (letters)0.02350.03290.04500.10990.12070.2270
Execution time (s) (words)0.02460.02800.03200.05720.08100.1282
Number of chapters12481632
Execution time (s) (letters)0.02350.03290.04500.10990.12070.2270
Execution time (s) (words)0.02460.02800.03200.05720.08100.1282
Table 7

Average execution times per comparison of the shingling algorithm

Number of chapters12481632
Execution time (s) (letters)0.02350.03290.04500.10990.12070.2270
Execution time (s) (words)0.02460.02800.03200.05720.08100.1282
Number of chapters12481632
Execution time (s) (letters)0.02350.03290.04500.10990.12070.2270
Execution time (s) (words)0.02460.02800.03200.05720.08100.1282

Compared to the Myers algorithm, shingling efficiently decreases the execution time, so that fifty-four MSS of Acts can be analyzed, using the letter-grams, in their entirety (twenty-eight chapters) in 10 min (2,916 comparisons × 0.21 s = 612 s = 10.2 min). Using the word-bigrams, one can complete the same task in 5 min (2,916 × 0.11 = 320 s = 5.3 min). This means that the algorithm can process all 530 MSS of Acts, once transcribed, in a little over 9 h (550 × 550 = 302,500 comparisons × 0.11 s = 33,275 s = 554.5 min = 9.2 h). The average execution times are summarized in Fig. 1:

Comparison of the execution times
Fig. 1.

Comparison of the execution times

The differences between the execution times of the letter- and word-grams rise from the number of shingles that need to be tabulated. Chapter 5 of Acts contains, depending on the manuscript, approximately 4,800 characters. This returns 4,791 letter-grams of k-size 10 and only 779 word-bigrams (from 780 words).

5.2 Selecting the similarity coefficient

Shingling efficiently processes large numbers of MSS, but what about the accuracy of the method? As can be seen in Table 6, the intersection between the example sets is 12 (bigrams shared between sets) and union 20 (the number of bigrams in both sets); hence, the Jaccard coefficient is 12/20 = 0.6 = 60%. However, one expects the similarity to be near 75%; thus, Jaccard seems to underestimate the similarity. In studies conducted in ecology, which uses different coefficients to estimate genetic similarities, the Jaccard coefficient is always smaller compared to other coefficients (Kosman and Leonard, 2005, p. 417; Dalirsefat et al., 2009, p. 7). The reason for this is the usage of union, which merges the distinctive elements of the two sets into one, treating the two sets as one unit (Sorensen, 1948, p. 6). This decreases similarity values, since the size of the denominator increases in the merging.

A simple solution would be to sum the shingles of one or the other of the sets and use it as the denominator. This seems to work well in the example, since the intersection (12) can be divided by the number of elements (16), which is the same in both sets: 12/16 = 0.75 = 75%. This, however, contains other problems. In Chapter 5 of Acts, the data taken from the Greek MSS differ in length by 5–10%. The lengths vary from one manuscript to the next due to accidental omissions of words, etc. If one chooses the size of the larger set and uses it as the denominator, the similarity values automatically decrease to a certain extent. On the other hand, choosing the smaller of the two sets (Overlap coefficient) automatically excludes all deletions from the calculations included in the original strings. This would be an enormous mistake, since deletions are one of the most important sources of variation in different manuscript traditions (Royse, 2008, pp. 100, 719; Trovato, 2014, p. 55).

There is another similarity coefficient, however, which is known to always be greater than Jaccard: the Sorensen-Dice coefficient (henceforth SDC). This similarity measure was independently developed by Thorvald Sorensen (1948) and Lee Dice (1945). SDC is closely related to Jaccard with some crucial modifications:

SDC equals twice the number of the intersection divided by the sum of the number of distinct elements in each set. When this is applied to our example:

SDC returns exactly the value one was expecting. SDC is usually greater than Jaccard, since it attaches more importance to shared elements (Kosman and Leonard, 2005, p. 418). The proportion of the intersection increases since it is multiplied by two and because the size of the set is always smaller than the union. Also, taking the size of both sets as the denominator decreases the effects of random length differences between MSS (Sorensen, 1948, p. 7). This method seems to work better than the other coefficients in the context of manuscript data.16

5.3 Comparing the properties of letter- and word-grams

The final issue to be tackled here is to compare the manner in which word- and letter-grams behave or handle the different types of variation found in the MSS. These include word changes, deletions, additions, and changes in word order.

The two k-gram types behave somewhat differently when encountering different types of variation. In Table 8, the word-bigrams are considered first. The variations between texts are marked using underscore. The number of differing word-bigrams is the same (2) when encountering different types of variation. The differences in the SDC similarities arise from the number of bigrams, which changes along with the varying string lengths. If the similarities are manually calculated, all these different variation types are simply recorded as a disagreement (i.e. they count as one change). Here a single word change is counted as two, since it affects two word-bigrams. Besides that, all single word variants are treated the same. Variations in nouns, verbs, verb tenses, or voices are all counted as two, regardless of the length of the word. Multiple word variants, on the other hand, affect more word-bigrams, increasing the number of differing bigrams. The SDC similarity values in the table seems to be rather low.

Table 8

Detecting variations using word-bigrams and SDC

Variation typeOriginal stringsBigram difference and SDC similarity
Word change‘the fast brown fox’(the fast, the slow), (fast brown, slow brown) = 2
‘the slow brown fox’2×1/(3 + 3) = 2/6 = 0.33 = 33%
Deletion‘the fast___fox’(fast fox, fast brown), (–, brown fox) = 2
‘the fast brown fox’2×1/(2 + 3) = 2/5 = 0.40 = 40%
Addition‘the fast brown fox’(fast brown, fast red), (–, red brown) = 2
‘the fast red brown fox’2×2/(3 + 4) = 4/7 = 0.57 = 57%
Word-order change‘the fast brown fox(fast brown, fast fox), (brown fox, fox brown), = 2
‘the fast fox brown2×1/(3 + 3) = 2/6 = 0.33 = 33%
Variation typeOriginal stringsBigram difference and SDC similarity
Word change‘the fast brown fox’(the fast, the slow), (fast brown, slow brown) = 2
‘the slow brown fox’2×1/(3 + 3) = 2/6 = 0.33 = 33%
Deletion‘the fast___fox’(fast fox, fast brown), (–, brown fox) = 2
‘the fast brown fox’2×1/(2 + 3) = 2/5 = 0.40 = 40%
Addition‘the fast brown fox’(fast brown, fast red), (–, red brown) = 2
‘the fast red brown fox’2×2/(3 + 4) = 4/7 = 0.57 = 57%
Word-order change‘the fast brown fox(fast brown, fast fox), (brown fox, fox brown), = 2
‘the fast fox brown2×1/(3 + 3) = 2/6 = 0.33 = 33%
Table 8

Detecting variations using word-bigrams and SDC

Variation typeOriginal stringsBigram difference and SDC similarity
Word change‘the fast brown fox’(the fast, the slow), (fast brown, slow brown) = 2
‘the slow brown fox’2×1/(3 + 3) = 2/6 = 0.33 = 33%
Deletion‘the fast___fox’(fast fox, fast brown), (–, brown fox) = 2
‘the fast brown fox’2×1/(2 + 3) = 2/5 = 0.40 = 40%
Addition‘the fast brown fox’(fast brown, fast red), (–, red brown) = 2
‘the fast red brown fox’2×2/(3 + 4) = 4/7 = 0.57 = 57%
Word-order change‘the fast brown fox(fast brown, fast fox), (brown fox, fox brown), = 2
‘the fast fox brown2×1/(3 + 3) = 2/6 = 0.33 = 33%
Variation typeOriginal stringsBigram difference and SDC similarity
Word change‘the fast brown fox’(the fast, the slow), (fast brown, slow brown) = 2
‘the slow brown fox’2×1/(3 + 3) = 2/6 = 0.33 = 33%
Deletion‘the fast___fox’(fast fox, fast brown), (–, brown fox) = 2
‘the fast brown fox’2×1/(2 + 3) = 2/5 = 0.40 = 40%
Addition‘the fast brown fox’(fast brown, fast red), (–, red brown) = 2
‘the fast red brown fox’2×2/(3 + 4) = 4/7 = 0.57 = 57%
Word-order change‘the fast brown fox(fast brown, fast fox), (brown fox, fox brown), = 2
‘the fast fox brown2×1/(3 + 3) = 2/6 = 0.33 = 33%

This is due to the size of the bigrams, which is too large to be used in these short examples. This points to another rule of choosing proper k-size, a negative version of the first one (see Section 5.1):

  • k must be small enough, or most strings will have few or no shared k-grams.

Choosing k that is too large leads to sparsity issues; that is, one would expect very few word-grams to appear in very few strings (Goldberg, 2017, p. 75). If so, one could have strings whose sets of word-grams have low similarity even though they contain several identical substrings. Letter–bigrams suit these short examples much better (i.e. their sizes are small enough). In Table 9, the similarity values of letter–bigrams and SDC are compared using the examples from the previous table.

Table 9

Detecting variations using letter–bigrams and SDC

Word changeDeletionAdditionWord-order change
(fast → slow)(delete word brown)(add word red)(brown ↔ fox)
75%76.92%88.88%93.75%
Word changeDeletionAdditionWord-order change
(fast → slow)(delete word brown)(add word red)(brown ↔ fox)
75%76.92%88.88%93.75%
Table 9

Detecting variations using letter–bigrams and SDC

Word changeDeletionAdditionWord-order change
(fast → slow)(delete word brown)(add word red)(brown ↔ fox)
75%76.92%88.88%93.75%
Word changeDeletionAdditionWord-order change
(fast → slow)(delete word brown)(add word red)(brown ↔ fox)
75%76.92%88.88%93.75%

Letter–bigrams and SDC detect all types of variations while giving a high similarity value to word-order changes, since they change two shingles only. Instead of ‘ox’ and ‘n’ in ‘the fast brown fox’, the change in word order results in the letter–bigrams ‘x’ and ‘wn’ in ‘the fast fox brown’. This reveals the importance of including the white spaces in the shingling process (Leskovec et al., 2014, p. 78). In the deletion of the word ‘brown’, the size of the set decreases, which in turn decreases the similarity compared to the word addition that increases the number of agreeing letter–bigrams, and the similarity values.17 Short word addition results in higher agreement values than longer ones. If one adds the word ‘medium’ instead of ‘red’ that gives 82.05% similarity rate since it increases the number of nonmatching letter–bigrams.

How should this behavior of the method be understood? Word-bigrams give more weight to multiword changes, while letter–bigrams also attach more importance to longer words, decreasing the agreements. This leads to the profound question of whether all variants should count in the same way. Should more weight be given to a change that was more difficult for scribes to make, such as longer variation? Here, it is not necessary to offer an answer but instead describe how the methodology used handles different types of variation. It does not give absolute results but embodies the dependencies that have now emerged. When using the letter-grams, one expects higher similarity values compared to calculations that are based on variation units, since letter-grams give less weight to word-order changes, which are exceedingly common in the MSS of the NT (Metzger and Ehrman, 2005, p. 257).

In the following, both word- and letter-grams are used to compare the results with the comparison datasets: sample dataset (based on the encodings in Hyytiäinen, 2021) and CBGM. As stated earlier, word-bigrams can be used in the MSS of Acts 5, but the size of letter–bigrams is exceedingly small (k must be large enough, or most strings will share most k-grams); hence, one must define a proper length of k. This is tested in Table 10 by applying SDC to the comparison datasets of Table 5 with differing lengths of k:

Table 10

Comparing similarity values

MSSSample dataCBGMSDC, k = 6SDC, k = 8SDC, k = 10SDC, k = 12
01, 0390.0093.3795.8594.6593.4292.33
03, 0564.9873.9383.2377.9573.0469.10
05, 0855.6567.9179.4972.8467.1762.62
MSSSample dataCBGMSDC, k = 6SDC, k = 8SDC, k = 10SDC, k = 12
01, 0390.0093.3795.8594.6593.4292.33
03, 0564.9873.9383.2377.9573.0469.10
05, 0855.6567.9179.4972.8467.1762.62
Table 10

Comparing similarity values

MSSSample dataCBGMSDC, k = 6SDC, k = 8SDC, k = 10SDC, k = 12
01, 0390.0093.3795.8594.6593.4292.33
03, 0564.9873.9383.2377.9573.0469.10
05, 0855.6567.9179.4972.8467.1762.62
MSSSample dataCBGMSDC, k = 6SDC, k = 8SDC, k = 10SDC, k = 12
01, 0390.0093.3795.8594.6593.4292.33
03, 0564.9873.9383.2377.9573.0469.10
05, 0855.6567.9179.4972.8467.1762.62

As expected, the values are mostly higher compared to those of the comparison sets, but they decrease when the size of k is increased. By increasing the size of k, the number of differing k-grams also increases (variations affect more k-grams), which decrease the agreement rates. The k-sizes of 6 and 8 seem to return overly high values, but the k-length of 10 come exceedingly close to the calculations of CBGM. It must be mentioned here that CBGM has higher similarity values compared to any of the previously conducted calculations (compare Osburn, 2004, p. 188 and Donker, 2011, p. 317). This is because more MSS are included in the analysis with more variation units (including singular readings), increasing the agreement rates (Mink, 2011, p. 147; Wasserman and Gurry, 2017, p. 41). Given the fact that letter-grams are expected to return high values due to the issues explained earlier, the k-size of 12 is perhaps too large (Leskovec et al., 2014, p. 79). Hence, the k-size of 10 seems to be the best option for the analysis at hand.

6 Establishing Quantitative Relationships between MSS of Acts 5

A random selection of ten MSS was made: P74, 03, 05, 614, 876, 1175, 1409, 1739, 1884 and 2200. The transcriptions of these MSS were used as an input for Relate, which returned the two symmetric similarity matrices in Table 11. This table presents the quantitative relationships between the MSS of Acts 5, using four different datasets: a sample dataset (based on the encodings in Hyytiäinen, 2021), CBGM, SDC using letter-grams of the k-size 10 and word-bigrams.

Table 11

Comparison of the similarity values in Acts 5

Sample dataseta
Relate (SDC, letter-gram, k = 10)
P74030561487611751409173918842200P74030561487611751409173918842200
P74100.091.8565.1680.2980.3794.4486.8987.7767.5386.66100.092.1671.0084.2384.0694.1287.5289.7474.1389.00
0391.85100.064.9879.7182.5092.1486.2887.8571.2286.4292.16100.073.0486.4487.8694.8089.8492.8078.7891.89
0565.1664.98100.063.0065.3466.7864.9667.1452.7266.7871.0073.04100.072.9874.7473.8572.2174.1966.3974.21
61480.2979.7163.00100.086.2382.9783.8885.8674.0886.5984.2386.4472.98100.090.4487.7386.9590.0480.9690.30
87680.3782.5065.3486.23100.083.2183.3986.4271.2285.3584.0687.8674.7490.44100.087.7486.6590.4578.6389.98
117594.4492.1466.7882.9783.21100.089.5391.4271.9490.7194.1294.8073.8587.7387.74100.091.5094.3278.6193.86
140986.8986.2864.9683.8883.3989.53100.089.1672.3690.2587.5289.8472.2186.9586.6591.50100.091.5278.0291.85
173987.7787.8567.1485.8686.4291.4289.16100.072.3097.1489.7492.8074.1990.0490.4594.3291.52100.079.6798.44
188467.5371.2252.7274.0871.2271.9472.3672.30100.073.0274.1378.7866.3980.9678.6378.6178.0279.67100.079.82
220086.6686.4266.7886.5985.3590.7190.2597.1473.02100.089.0091.8974.2190.3089.9893.8691.8598.4479.82100.0

CBGMRelate (SDC, word-bigram)

P74030561487611751409173918842200P74030561487611751409173918842200

P74100.094.8873.1784.3884.3095.8990.0791.1379.5190.10100.091.9172.5885.0184.7594.2587.6990.0774.7789.29
0394.88100.073.9385.2386.4594.2690.6391.2781.6590.0691.91100.074.1286.7087.9994.2689.6592.2879.1791.36
0573.1773.93100.072.1073.3175.3873.2375.4666.9875.4672.5874.12100.074.0475.4975.3973.5675.9067.0275.85
61484.3885.2372.10100.089.5486.7385.8088.6284.3889.2385.0186.7074.04100.090.8488.5787.7090.6081.3690.67
87684.3086.4573.3189.54100.086.7186.1089.4681.0488.5584.7587.9975.4990.84100.088.2987.4290.8778.9690.24
117595.8994.2675.3886.7386.71100.092.7393.3582.2192.7594.2594.2675.3988.5788.29100.091.5394.4379.0693.94
140990.0790.6373.2385.8086.1092.73100.091.2481.2990.9487.6989.6573.5687.7087.4291.53100.091.7178.4891.93
173991.1391.2775.4688.6289.4693.3591.24100.082.5797.5990.0792.2875.9090.6090.8794.4391.71100.079.8398.36
188479.5181.6566.9884.3881.0482.2181.2982.57100.083.1874.7779.1767.0281.3678.9679.0678.4879.83100.079.78
220090.1090.0675.4689.2388.5592.7590.9497.5983.18100.089.2991.3675.8590.6790.2493.9491.9398.3679.78100.0
Sample dataseta
Relate (SDC, letter-gram, k = 10)
P74030561487611751409173918842200P74030561487611751409173918842200
P74100.091.8565.1680.2980.3794.4486.8987.7767.5386.66100.092.1671.0084.2384.0694.1287.5289.7474.1389.00
0391.85100.064.9879.7182.5092.1486.2887.8571.2286.4292.16100.073.0486.4487.8694.8089.8492.8078.7891.89
0565.1664.98100.063.0065.3466.7864.9667.1452.7266.7871.0073.04100.072.9874.7473.8572.2174.1966.3974.21
61480.2979.7163.00100.086.2382.9783.8885.8674.0886.5984.2386.4472.98100.090.4487.7386.9590.0480.9690.30
87680.3782.5065.3486.23100.083.2183.3986.4271.2285.3584.0687.8674.7490.44100.087.7486.6590.4578.6389.98
117594.4492.1466.7882.9783.21100.089.5391.4271.9490.7194.1294.8073.8587.7387.74100.091.5094.3278.6193.86
140986.8986.2864.9683.8883.3989.53100.089.1672.3690.2587.5289.8472.2186.9586.6591.50100.091.5278.0291.85
173987.7787.8567.1485.8686.4291.4289.16100.072.3097.1489.7492.8074.1990.0490.4594.3291.52100.079.6798.44
188467.5371.2252.7274.0871.2271.9472.3672.30100.073.0274.1378.7866.3980.9678.6378.6178.0279.67100.079.82
220086.6686.4266.7886.5985.3590.7190.2597.1473.02100.089.0091.8974.2190.3089.9893.8691.8598.4479.82100.0

CBGMRelate (SDC, word-bigram)

P74030561487611751409173918842200P74030561487611751409173918842200

P74100.094.8873.1784.3884.3095.8990.0791.1379.5190.10100.091.9172.5885.0184.7594.2587.6990.0774.7789.29
0394.88100.073.9385.2386.4594.2690.6391.2781.6590.0691.91100.074.1286.7087.9994.2689.6592.2879.1791.36
0573.1773.93100.072.1073.3175.3873.2375.4666.9875.4672.5874.12100.074.0475.4975.3973.5675.9067.0275.85
61484.3885.2372.10100.089.5486.7385.8088.6284.3889.2385.0186.7074.04100.090.8488.5787.7090.6081.3690.67
87684.3086.4573.3189.54100.086.7186.1089.4681.0488.5584.7587.9975.4990.84100.088.2987.4290.8778.9690.24
117595.8994.2675.3886.7386.71100.092.7393.3582.2192.7594.2594.2675.3988.5788.29100.091.5394.4379.0693.94
140990.0790.6373.2385.8086.1092.73100.091.2481.2990.9487.6989.6573.5687.7087.4291.53100.091.7178.4891.93
173991.1391.2775.4688.6289.4693.3591.24100.082.5797.5990.0792.2875.9090.6090.8794.4391.71100.079.8398.36
188479.5181.6566.9884.3881.0482.2181.2982.57100.083.1874.7779.1767.0281.3678.9679.0678.4879.83100.079.78
220090.1090.0675.4689.2388.5592.7590.9497.5983.18100.089.2991.3675.8590.6790.2493.9491.9398.3679.78100.0
a

The encodings of the sample data were processed by the Relate program using hamming distance, which calculates how many characters match between pairs of MSS, divided by the length of character sequences (279).

Table 11

Comparison of the similarity values in Acts 5

Sample dataseta
Relate (SDC, letter-gram, k = 10)
P74030561487611751409173918842200P74030561487611751409173918842200
P74100.091.8565.1680.2980.3794.4486.8987.7767.5386.66100.092.1671.0084.2384.0694.1287.5289.7474.1389.00
0391.85100.064.9879.7182.5092.1486.2887.8571.2286.4292.16100.073.0486.4487.8694.8089.8492.8078.7891.89
0565.1664.98100.063.0065.3466.7864.9667.1452.7266.7871.0073.04100.072.9874.7473.8572.2174.1966.3974.21
61480.2979.7163.00100.086.2382.9783.8885.8674.0886.5984.2386.4472.98100.090.4487.7386.9590.0480.9690.30
87680.3782.5065.3486.23100.083.2183.3986.4271.2285.3584.0687.8674.7490.44100.087.7486.6590.4578.6389.98
117594.4492.1466.7882.9783.21100.089.5391.4271.9490.7194.1294.8073.8587.7387.74100.091.5094.3278.6193.86
140986.8986.2864.9683.8883.3989.53100.089.1672.3690.2587.5289.8472.2186.9586.6591.50100.091.5278.0291.85
173987.7787.8567.1485.8686.4291.4289.16100.072.3097.1489.7492.8074.1990.0490.4594.3291.52100.079.6798.44
188467.5371.2252.7274.0871.2271.9472.3672.30100.073.0274.1378.7866.3980.9678.6378.6178.0279.67100.079.82
220086.6686.4266.7886.5985.3590.7190.2597.1473.02100.089.0091.8974.2190.3089.9893.8691.8598.4479.82100.0

CBGMRelate (SDC, word-bigram)

P74030561487611751409173918842200P74030561487611751409173918842200

P74100.094.8873.1784.3884.3095.8990.0791.1379.5190.10100.091.9172.5885.0184.7594.2587.6990.0774.7789.29
0394.88100.073.9385.2386.4594.2690.6391.2781.6590.0691.91100.074.1286.7087.9994.2689.6592.2879.1791.36
0573.1773.93100.072.1073.3175.3873.2375.4666.9875.4672.5874.12100.074.0475.4975.3973.5675.9067.0275.85
61484.3885.2372.10100.089.5486.7385.8088.6284.3889.2385.0186.7074.04100.090.8488.5787.7090.6081.3690.67
87684.3086.4573.3189.54100.086.7186.1089.4681.0488.5584.7587.9975.4990.84100.088.2987.4290.8778.9690.24
117595.8994.2675.3886.7386.71100.092.7393.3582.2192.7594.2594.2675.3988.5788.29100.091.5394.4379.0693.94
140990.0790.6373.2385.8086.1092.73100.091.2481.2990.9487.6989.6573.5687.7087.4291.53100.091.7178.4891.93
173991.1391.2775.4688.6289.4693.3591.24100.082.5797.5990.0792.2875.9090.6090.8794.4391.71100.079.8398.36
188479.5181.6566.9884.3881.0482.2181.2982.57100.083.1874.7779.1767.0281.3678.9679.0678.4879.83100.079.78
220090.1090.0675.4689.2388.5592.7590.9497.5983.18100.089.2991.3675.8590.6790.2493.9491.9398.3679.78100.0
Sample dataseta
Relate (SDC, letter-gram, k = 10)
P74030561487611751409173918842200P74030561487611751409173918842200
P74100.091.8565.1680.2980.3794.4486.8987.7767.5386.66100.092.1671.0084.2384.0694.1287.5289.7474.1389.00
0391.85100.064.9879.7182.5092.1486.2887.8571.2286.4292.16100.073.0486.4487.8694.8089.8492.8078.7891.89
0565.1664.98100.063.0065.3466.7864.9667.1452.7266.7871.0073.04100.072.9874.7473.8572.2174.1966.3974.21
61480.2979.7163.00100.086.2382.9783.8885.8674.0886.5984.2386.4472.98100.090.4487.7386.9590.0480.9690.30
87680.3782.5065.3486.23100.083.2183.3986.4271.2285.3584.0687.8674.7490.44100.087.7486.6590.4578.6389.98
117594.4492.1466.7882.9783.21100.089.5391.4271.9490.7194.1294.8073.8587.7387.74100.091.5094.3278.6193.86
140986.8986.2864.9683.8883.3989.53100.089.1672.3690.2587.5289.8472.2186.9586.6591.50100.091.5278.0291.85
173987.7787.8567.1485.8686.4291.4289.16100.072.3097.1489.7492.8074.1990.0490.4594.3291.52100.079.6798.44
188467.5371.2252.7274.0871.2271.9472.3672.30100.073.0274.1378.7866.3980.9678.6378.6178.0279.67100.079.82
220086.6686.4266.7886.5985.3590.7190.2597.1473.02100.089.0091.8974.2190.3089.9893.8691.8598.4479.82100.0

CBGMRelate (SDC, word-bigram)

P74030561487611751409173918842200P74030561487611751409173918842200

P74100.094.8873.1784.3884.3095.8990.0791.1379.5190.10100.091.9172.5885.0184.7594.2587.6990.0774.7789.29
0394.88100.073.9385.2386.4594.2690.6391.2781.6590.0691.91100.074.1286.7087.9994.2689.6592.2879.1791.36
0573.1773.93100.072.1073.3175.3873.2375.4666.9875.4672.5874.12100.074.0475.4975.3973.5675.9067.0275.85
61484.3885.2372.10100.089.5486.7385.8088.6284.3889.2385.0186.7074.04100.090.8488.5787.7090.6081.3690.67
87684.3086.4573.3189.54100.086.7186.1089.4681.0488.5584.7587.9975.4990.84100.088.2987.4290.8778.9690.24
117595.8994.2675.3886.7386.71100.092.7393.3582.2192.7594.2594.2675.3988.5788.29100.091.5394.4379.0693.94
140990.0790.6373.2385.8086.1092.73100.091.2481.2990.9487.6989.6573.5687.7087.4291.53100.091.7178.4891.93
173991.1391.2775.4688.6289.4693.3591.24100.082.5797.5990.0792.2875.9090.6090.8794.4391.71100.079.8398.36
188479.5181.6566.9884.3881.0482.2181.2982.57100.083.1874.7779.1767.0281.3678.9679.0678.4879.83100.079.78
220090.1090.0675.4689.2388.5592.7590.9497.5983.18100.089.2991.3675.8590.6790.2493.9491.9398.3679.78100.0
a

The encodings of the sample data were processed by the Relate program using hamming distance, which calculates how many characters match between pairs of MSS, divided by the length of character sequences (279).

When the two datasets are compared, established using variation units, CBGM has higher similarities, with manuscript 1884 being a somewhat extreme example of this. The difference between the sample dataset and CBGM averages (with 1884) 10%.

SDC using the letter- or word-grams sometimes has higher values than CBGM, sometimes vice versa. In fact, letter– and word–bigrams return coherent results despite using different types of k-grams. Even though the values of CBGM and SDC slightly differ, they both follow a similar pattern. MSS 05 and 1884 are known for their eccentric texts, leading to lower similarities compared to other MSS. On the other hand, P74, 03, and 1175 are known to represent a similar form of text, which is generally thought to represent a very early text (Osburn, 2004, p. 187). Their close relations are reflected in the values found in all matrices. This pattern can be seen in the phylogenetic trees below (Fig. 2), a model commonly used in phylogenetics and computer-assisted stemmatology to represent the relationships between objects under study. Here the SplitsTree program is used (Huson and Bryant, 2006) with the UPGMA (unweighted pair-group method using arithmetic averages) algorithm, which is a simple clustering method (Sokal and Michener, 1958). The point here is not to make suggestions about the history of these MSS, but rather to depict grouping patterns of them. The lengths of the branches vary, meaning that the similarity values deviate but the pattern remains the same (i.e. the same MSS are seen together in both trees). This means that the methodology proposed here was able to connect the same MSS to one another, like CBGM but in a computerized manner.

Phylogenetic Trees of Acts 5
Fig. 2.

Phylogenetic Trees of Acts 5

It is not surprising to find the same patterns in both trees due to the limited size of the datasets. Taking more MSS into the calculations would most likely result in different branching patterns.

7 Conclusions

This article aimed to propose a new method for establishing quantitative relationships between MSS, allowing one to take all manuscript evidence into account without laborious and time-consuming preprocessing steps. It demonstrates that this can be accomplished using shingling and the SDC. This methodology does not treat every type of variation the same, but attaches more importance to longer differences between MSS, giving lesser weight to other variants, such as word-order changes. This approach does not yield absolute or more accurate results per se than conventional methods, since all methods have their own dependencies and are therefore relative. However, the similarity values that Relate returns are comparable with those of CBGM, for instance, and deserve to be studied more thoroughly. The next step is to apply this methodology to artificially created manuscript traditions where the exact transmission history and correct stemma are known (as in Roos and Heikkilä, 2009, pp. 422–27). This is an excellent way to test the real accuracy of the approach. The investigation at hand, on the other hand, demonstrates that it is indeed possible to establish the quantitative relationships between MSS without collations, variation units, or encodings while producing results that are close enough to those of conventional approaches to be subjected to further investigations.

Funding

This article was supported by the Olvi Foundation (grant number 20210940).

References

Cantor
G.
(
1915
).
Contributions to the Founding of the Theory of Transfinite Numbers
.
New York
:
Dover Publications
.

Colwell
E. C.
,
Tune
E. W.
(
1963
). The quantitative relationships between MS text-types. In
Birdsall
J. N.
,
Thomson
R. W.
(eds),
Biblical and Patristic Studies in Memory of Robert Pierce Casey.
Freiburg i. B
:
Herder
, pp.
25
32
.

Colwell
E. C.
,
Tune
E. W.
(
1964
).
Variant readings: classification and use
.
Journal of Biblical Literature
,
83
:
253
62
.

Dalirsefat
S.
,
Meyer
A.
,
Mirhoseini
S.
(
2009
).
Comparison of similarity coefficients used for cluster analysis with amplified fragment length polymorphism markers in the silkworm, Bombyx Mori
.
Journal of Insect Science
,
9
(
71
):
1
8
.

Dekker
R. H.
,
van Hulle
D.
,
Middell
G.
,
Neyt
V.
,
van Zundert
J.
(
2015
).
Computer-supported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript Project,
Digital Scholarship in the Humanities
,
30
(
3
):
452
70
.

Dice
L.
(
1945
).
Measures of the amount of ecologic association between species
.
Ecology
,
26
(
3
):
297
302
.

Donker
G. J.
(
2011
).
The Text of the Apostolos in Athanasius of Alexandria
.
Atlanta
:
Society of Biblical Literature
.

Ehrman
B. D.
(
1986
).
Didymus the Blind and the Text of the Gospels
.
Atlanta
:
Scholars Press
.

Epp
E. J.
(
1976
). Toward the clarification of the term ‘textual variant’. In
Elliot
J. K.
(ed.),
Studies in New Testament Text and Language: Essays in Honor of George D. Kilpatrick
.
Leiden
:
Brill
, pp.
153
73
.

Fee
G. D.
(
1968
).
Papyrus Bodmer II (P66): Its Textual Relationships and Scribal Characteristics
.
Salt Lake City
:
University of Utah Press
.

Fee
G. D.
(
1993
). On the types, classification, and presentation of textual variation. In
Epp
E. J.
,
Fee
G. D.
(eds),
Studies in the Theory and Method of New Testament Textual Criticism.
Grand Rapids
:
Eerdmans
, pp.
62
80
.

Finney
T.
(
2010
).
Mapping textual space
.
TC: A Journal of Biblical Textual Criticism
,
15
. http://jbtc.org/v15/Mapping/index.html.

Finney
T.
(
2018
).
How to discover textual groups
.
Digital Studies/Le Champ Numérique
,
8
(
1
):
7
.

Geer
T. C.
(
1994
).
Family 1739 in Acts
.
Atlanta
:
Scholars Press
.

Geer
T. C.
,
Racine
J.-F.
(
2013
). Analyzing and categorizing New Testament Greek manuscripts. In
Ehrman
B. D.
,
Holmes
M. W.
(eds),
The Text of the New Testament in Contemporary Research: Essays on the Status Questionis
.
Leiden
:
Brill
, pp.
497
518
.

Goldberg
Y.
(
2017
).
Neural Network Methods in Natural Language Processing
.
San Rafael, CA
:
Morgan & Claypool
.

Griffith
J. G.
(
1969
).
Numerical taxonomy and some primary manuscripts of the Gospels
.
The Journal of Theological Studies
,
20
(
2
):
389
406
.

Howe
C. J.
,
Barbrook
A. C.
,
Spencer
M.
,
Robinson
P.
,
Bordalejo
B.
,
Mooney
L. R.
(
2001
).
Manuscript evolution
.
Trends in Genetics
,
17
(
3
):
147
52
.

Howe
C. J.
,
Connolly
R.
,
Windram
H. F.
(
2012
).
Responding to criticisms of phylogenetic methods in stemmatology
.
Studies in English Literature 1500–1900
,
52
:
51
67
.

Hurtado
L. W.
(
1981
).
Text-Critical Methodology and the Pre-Caesarean Text: Codex W in the Gospel of Mark
.
Grand Rapids
:
Eerdmans
.

Huson
D. H.
,
Bryant
D.
(
2006
).
Application of phylogenetic networks in evolutionary studies
.
Molecular Biology and Evolution
,
23
:
254
67
.

Hyytiäinen
P.
(
2021
).
The changing text of acts: a phylogenetic approach
.
TC: A Journal of Biblical Textual Criticism
,
26
:
1
28
. http://jbtc.org/v26/TC-2021-Hyyti%C3%A4inen.pdf.

Jaccard
P.
(
1901
).
Étude comparative de la distribution florale dans une portion des Alpes et des Jura
.
Bulletin de la Société vaudoise des sciences naturelles
,
37
:
547
79
.

Jokinen
P.
,
Tarhio
J.
,
Ukkonen
E.
(
1996
).
A comparison of approximate string matching algorithms
.
Software - Practice and Experience
,
26
:
1439
58
.

Kosman
E.
,
Leonard
K. J.
(
2005
).
Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species
.
Molecular Ecology
,
14
(
2
):
415
24
.

Lemey
P.
,
Salemi
M.
,
Vandamme
A.-M.
(eds) (
2009
).
The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing
.
Cambridge
:
Cambridge University Press
.

Leskovec
J.
,
Rajaraman
A.
,
Ullman
J. D.
(
2014
).
Mining of Massive Datasets
.
Cambridge
:
Cambridge University Press
.

Levenshtein
V. I.
(
1966
).
Binary codes capable of correcting deletions
,
insertions and reversals. Soviet Physics Doklady
,
10
(
8
):
707
10
.

Lin
Y-J.
(
2016
).
The Erotic Life of Manuscripts: New Testament Textual Criticism and the Biological Sciences
.
Oxford
:
Oxford University Press
.

McCollum
J.
(
2019
).
Biclustering readings and manuscripts via non-negative matrix factorization, with application to the text of Jude
.
AUSS
,
57
(
1
):
61
89
.

Metzger
B. M.
,
Ehrman
B. D.
(
2005
).
The Text of the New Testament: Its Transmission, Corruption and Restoration
.
Oxford
:
Oxford University Press
.

Mink
G.
(
2004
). Problems of a highly contaminated tradition: the New Testament Stemmata of variants as a source of a genealogy for witnesses. In
van Reenen
P.
,
den Hollander
A.
,
van Mulken
M.
(eds),
Studies in Stemmatology II.
Amsterdam
:
John Benjamins
, pp.
13
85
.

Mink
G.
(
2008
). CBGM presentation. Presented at the Münster Colloquium on the Textual History of the Greek New Testament, Münster, Germany, 3–6 August.

Mink
G.
(
2011
). Contamination, coherence and coincidence in textual transmission: the Coherence-Based Genealogical Method (CBGM) as a complement and corrective to existing approaches. In
Wachtel
K.
,
Holmes
M. W.
(eds),
Textual History of the Greek New Testament.
Atlanta
:
Society of Biblical Literature
, pp.
141
216
.

Myers
G.
(
1999
).
A fast bit-vector algorithm for approximate string matching based on dynamic progamming
.
Journal of the ACM
,
46
(
3
):
395
415
.

O’Hara
R.
,
Robinson
P.
(
1993
).
Computer-assisted methods of stemmatic analysis. In Blake, N. F. and Robinson, P. (eds),
The Canterbury Tales Project, Occasional Papers I
. Oxford: Office for Humanities Communication, pp.
53
74
.

Osburn
C. D.
(
2004
).
The Text of Apostolos in Epiphanius of Salamis
.
Atlanta
:
Society of Biblical Literature
.

Racine
J. F.
(
2004
).
The Text of Matthew in the Writings of Basil of Caesarea
.
Atlanta
:
Society of Biblical Literature
.

Robinson
P. M. W.
(
1992
). Collate: A Program for Interactive Collation of Large Textual Traditions, Version 1.1, Computer program distributed by the Computers and Manuscripts Project. Oxford University Computing Services.

Roos
T.
,
Heikkilä
T.
(
2009
).
Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets
.
Literary and Linguistic Computing
,
24
:
417
33
.

Royse
J. R.
(
2008
).
Scribal Habits in Early Greek New Testament Papyri
.
Leiden
:
Brill
.

Sokal
R. R.
,
Michener
C. D.
(
1958
).
A statistical method for evaluating systematic relationships
.
University of Kansas Science Bulletin
,
38
:
1409
38
.

Sorensen
T.
(
1948
).
A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish Commons
.
Kongelige Danske Videnskabernes Selskab
,
5
(
4
):
1
34
.

Thorpe
J. C.
(
2002
).
Multivariate statistical analysis for manuscript classification
.
TC: A Journal of Biblical Textual Criticism
,
7
. http://jbtc.org/v07/Thorpe2002.html.

Trovato
P.
(
2014
).
Everything You Always Wanted to Know about Lachmann’s Method: A Non-Standard Handbook of Genealogical Textual Criticism in the Age of Post-Structuralism, Cladistics, and Copy-Text
.
Padova
:
libreriauniversitaria.it
.

Ukkonen
E.
(
1992
).
Approximate string-matching with q-grams and maximal matches
.
Theoretical Computer Science
,
92
:
191
211
.

Wachtel
K.
(
2000
). Colwell revisited: grouping New Testament manuscripts. In
Amphoux
C.-B.
,
Elliott
J.K.
(eds),
The New Testament Text in Early Christianity: Proceedings from the Lille Colloquium, July 2000.
Lausanne
:
Zèbre
, pp.
31
43
.

Wasserman
T.
,
Gurry
P.
(
2017
).
A New Approach to Textual Criticism: An Introduction to the Coherence-Based Genealogical Method
.
Atlanta
:
Society of Biblical Literature
.

Footnotes

1

This is not the only goal of textual criticism. Narrative textual criticism, for instance, studies the variants themselves, their meaning, and their theological or social–historical significance (Lin, 2016, pp. 94–109).

2

All sequences of texts are discarded where there is no variation at all among the MSS, since these sequences tell nothing new about the relationships between them (Wasserman and Gurry, 2017, p. 40).

3

See useful account of the history and development of grouping NT MSS in Geer and Racine (2013).

4

This holds true in relation to Coherence-Based Genealogical Method (CBGM) only, since manuscript groups or groupings are vital part of other computer-assisted stemmatological methods (O’Hara and Robinson, 1993, p. 58).

5

In CBGM, the texts are separated from their physical carriers (i.e. the MSS) and called a witness (Mink, 2004, p. 29).

6

See the liste at http://ntvmr.uni-muenster.de/liste, where the latest numbers of surviving MSS can be found.

7

The most comprehensive analysis to date, the Editio Critica Maior, considers 183 MSS of Acts in their entirety. (Wasserman and Gurry, 2017, p. 38).

8

Weighing and weighting of variants should not be confused here. One is not assigning here discrete weights to variants but weighing the significance of the variants. Thus, the term weighing of variants is used in this article.

9

Though CollateX can be seen as the successor to Peter Robinson’s software, Collate (Robinson, 1992), it is a completely new program.

10

See Mink (2004, pp. 27–8) concerning the problem of the extent of the places of variation.

11

These types of variation places are known in the phylogenetic context as parsimony informative sites (Lemey et al., 2009, p. 665).

12

The point here is to depict the way critics have conducted their analyses without making any suggestions about whose approach is right or wrong as such.

13

All the analyses are conducted using iMac 2.7 GHz (quad-core) Intel Core i5 with 16 Gt memory.

14

The Python package Polyleven employed here uses this algorithm: https://ceptord.net/2018121e5-polyleven.html.

15

That is, duplicates are deleted within a set, not between sets.

16

In computer-assisted stemmatological approaches that use phylogenetic methods, one can use a technique called bootstrapping for testing the statistical robustness of an analysis (See Hyytiäinen, 2021, pp. 17–8). This is not a possibility for the proposed method, which is a potential limitation.

17

While the length of a string decreases (and the size of a set), changes between strings count more since the values of numerator and denominator both decreases, and consequently, the proportional similarity values.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.