Sequence comparison in computational historical linguistics

With increasing amounts of digitally available data from all over the world, manual annotation of cognates in multi-lingual word lists becomes more and more time-consuming in historical linguistics. Using available software packages to pre-process the data prior to manual analysis can drastically speed-up the process of cognate detection. Furthermore, it allows us to get a quick overview on data which have not yet been intensively studied by experts. LingPy is a Python library which provides a large arsenal of routines for sequence comparison in historical linguistics. With LingPy, linguists can not only automatically search for cognates in lexical data, but they can also align the automatically identified words, and output them in various forms, which aim at facilitating manual inspection. In this tutorial, we will briefly introduce the basic concepts behind the algorithms employed by LingPy and then illustrate in concrete workflows how automatic sequence comparison can be applied to multi-lingual word lists. The goal is to provide the readers with all information they need to (1) carry out cognate detection and alignment analyses in LingPy, (2) select the appropriate algorithms for the appropriate task, (3) evaluate how well automatic cognate detection algorithms perform compared to experts, and (4) export their data into various formats useful for additional analyses or data sharing. While basic knowledge of the Python language is useful for all analyses, our tutorial is structured in such a way that scholars with basic knowledge of computing can follow through all steps as well.


Introduction
Sequence comparison is one of the key tasks in historical linguistics.By comparing words or morphemes across languages, linguists can identify which words have sprung from a common source in genetically related languages, or which words have been borrowed from one language to another.By comparing words within a language, linguists can identify grammatical and lexical morphemes, cluster words into families, and shed light on the internal history of languages.So far the majority of this work has been carried out manually.Linguists sift through dictionaries and fieldwork notes, trying to identify those words which reflect a shared history across languages.All etymological dictionaries available today have been based on manual word comparison and their results fill thousands of pages.Even the largest databases which offer cognate judgments, such as the Austronesian Basic Vocabulary Database (ABVD, Greenhill et al., 2008) or the Indo-European Lexical Cognacy Database (Dunn, 2012) are based on manual assessments of cognacy.
With the increasing amounts of digitally available data it becomes harder for linguists to keep up.For example, the Sino-Tibetan Etymological and Thesaurus database (Matisoff, 2015), contains more than 500,000 words, but only a small amount of words have been compared etymologically (see Hill and List, 2017: 64f).We need to take advantage of increasing amounts of data, refining work on well-established languages, and fostering work on the world's understudied languages.To do this, however, we will have to rethink the way we compare languages.
Historical linguists are skeptical about automating the methods for cognate identification (see Holman et al. (2011) and commentaries, as well as List et al. (2017b)).First, the accuracy of automated methods is often low, failing to reproduce the analyses of linguistic experts.Especially, the use of the edit distance (Levenshtein, 1965) has been criticized for being linguistically too nave, conflating sound correspondences and lexical replacement, to be useful for subgrouping or cognate detection (Campbell, 2011;Greenhill, 2011).Second, it is hard to verify many algorithms as they are seen as black-boxes which hide the crucial decisions leading to cognate judgments and subgroupings, making it difficult for scholars to determine whether similarities are due to inheritance or contact (Ja ¨ger, 2015;List et al., 2017b).The nontransparency of automatic methods is highly problematic for computational historical linguistics: if we do not know what evidence decisions are based on, we cannot criticize and improve them.
However, methods for automatic sequence comparison in historical linguistics have dramatically improved during the last two decades.Starting with the pioneering work on pairwise and multiple phonetic alignment (Kondrak, 2000;Proki c et al., 2009), new methods for phonetic alignment and automatic cognate detection solve both the problems of verification and accuracy (List et al., 2017b;Ja ¨ger et al., 2017).First, these algorithms are based on phonetically informed metrics on sound similarities.Importantly, any algorithmically identified correspondences are logged and can be inspected by researchers.Second, in a wide-ranging test of these methods, they have been found to be highly accurate and able to correctly identify cognates in almost 90% of the cases (List et al., 2017b).
LingPy (List et al., 2017a) provides these algorithms as part of a stable open-source software package that works on all major platforms.Given the complexity of the problems involving sequence comparison in historical linguistics, computers will not be able to replace human judgments any time soon, but with the recent advancements, the methods are definitely good enough to provide substantial help for classical historical linguists to pre-analyze the data to be later corrected by experts, or to check the consistency of human cognate judgments.Over the long run, computational methods can also contribute to the bigger questions of language evolution, be it indirectly, by increasing the amount of digitally available high-quality annotated data, or directly, by providing scholars' access to data too large to be processed by humans alone.
In the following, we will give a concise overview on how automatic sequence comparison can be carried out.After discussing general aspects of sequence comparison (Section 2), we will introduce basic ideas on the data needed (Section 3).We will then turn to the core tasks of automatic sequence comparison, namely automatic phonetic alignment (Section 4) and automatic cognate detection (Section 5).We conclude by showing how automatic approaches for cognate detection can be evaluated (Section 6), and how results can be exported to various formats (Section 7).
This article is supplemented by a detailed interactive tutorial in form of an IPython Notebook (Pe ´rez and Granger, 2007) which illustrates how all methods discussed here can be practically applied (see the Supplementary material for more information).Having installed the necessary software (Tutorial: 1), readers can follow the tutorial step by step and investigate how the algorithms work in practise.Our data is based on a small sample of Polynesian languages taken from the ABVD, which we substantially revised, both with respect to the phonetic transcriptions and the expert cognate judgments.All data needed to replicate the analyses discussed here are supplemented.We give more information in the interactive tutorial (Tutorial: 2.1).

Basic aspects of sequence comparison
The words and morphemes which constitute a language are best modeled as sequences of sounds.Sequences have information content not only from their elements (segments, whether these are phonemes, graphemes, or morphemes) but also via the order of the elements, a consistent comparison of sequences should account for both order and content.Alignments are a very general way to model differences between sequences.The major idea is to arrange two or more sequences in a matrix in such a way that similar or identical segments which occur in similar positions are placed in the same column of the matrix.If segments are missing in one sequence where no counterpart for a segment can be found, this is represented by a gap character, usually the dash-symbol (List, 2014b).
Sequence alignments are crucial in biology, where they are used to compare protein and DNA sequences (Durbin et al., 2002).In historical linguistics, however, they are usually only implicitly employed, and initial attempts to arrange cognate words in a matrix go back to the early 20th century, as one can see from an early example based on Dixon and Kroeber (1919: 61) given in Fig. 1.The authors themselves describe this way of representing sequence similarities as a 'columnar form' with the goal to 'bring out parallelisms that otherwise might fail to impress without detailed analysis and discussion' (Dixon and Kroeber, 1919: 55).The figure further shows how the data would look if they were rendered in contemporary alignment editors for historical linguistics (List, 2017).Dixon and Kroeber's wording nicely expresses one of the major advantages of alignments: the transparency of homology assessments.Scholars often list long lists of cognate sets in the literature, claiming that all words are somehow related to each other, but if they do not list the alignments, it is often impossible, even for experts in the same language family, to understand where exactly the authors think that certain segments are similar.
Given that the inference of historically related words is not based on superficial word similarities but on recurrent systematic similarities, known as regular sound correspondences (Lass, 1997: 130), all judgments regarding the relatedness of words across languages directly rely on previously established sequence alignments (Fox, 1995: 67f).Alignment analyses not only increase the transparency of cognate judgments, but they also play a crucial role in substantiating these judgments in a first place.As can be seen from Table 1, similarities in cognate words in Sikaiana and Tahitian (data taken from Greenhill et al., 2008) are not based on the identity of sounds, but rather in the regularity of occurrence: whenever Sikaiana has a [k] and a [l], Tahitian has a [?] and a [r], respectively.Without alignments, we could not identify this similarity.Alignments are also at the core of all automatic sequence comparison approaches in historical linguistics, as we will see throughout this tutorial.

Data preparation
When searching for cognates across languages, we usually assume that our data are given in some kind of wordlist, a list in which a number of concepts is translated into various languages.How many concepts we select depends on the research question, and various concept lists and questionnaires, ranging from 40 (Brown et al., 2008) up to more than 1,000 concepts (Haspelmath and Tadmor, 2009) have been proposed so far (see the overview in List et al. (2016a)).Our data example for this tutorial is based on the questionnaire of the ABVD project (Greenhill et al., 2008), consisting of 210 concepts, which were translated into 31 different Polynesian languages.For closely related languages, such as those in the Polynesian family, this gives us enough information to infer regular correspondences automatically, although it is clear that for analyses of  Dixon and Kroeber, 1919), contrasted with a 'modern' representation using the EDICTOR tool (List, 2017).more distant language relationship the number of words per language may not be enough.The basic format used by LingPy is a tab-separated input file in which the first row serves as a header and defines the content of the rest of the rows.The very first column is reserved for numerical identifiers (which all need to be unique), while the order of the other columns is arbitrary, with specific columns being required, and others being optional.Essential columns which always must be provided are the language name (DOCULECT), the comparison concept (CONCEPT), the original transcription (International Phonetic Alphabet (IPA), FORM, or VALUE), and a space-segmented form of the transcription (TOKENS).Multiple synonyms for the same comparison concept in the same language should be written in separate rows and given a separate ID each.The data in the TOKENS-column should supply the transcriptions in space-segmented form, that is, instead of transcribing the Fila word for 'all' as [eutSi], the software expects [e u tS i], which is internally interpreted as a sequence of five segments, namely [e], [u], [tS] and [i], with [tS] representing a voiceless post-alveolar affricate.If the TOKENS are not supplied to the algorithm, it will try to segment the data automatically, provided it can find the column IPA, which is otherwise not necessarily required to appear in the data.This however, may lead to various problems and unexpected behavior.We therefore urge all users of LingPy to make sure that they supply segmented data to the algorithm, making furthermore sure that they adhere to the general standards of transcription as they are represented in the IPA (IPA, 1999). 1 The format can be created manually by using either a text editor, or a spreadsheet program that allows to export to tab-separated format.To a large degree, this input format is compatible with the one advocated by the Cross-Linguistic Data Formats (CLDF) initiative (Forkel et al., 2017), the main difference being that LingPy requires a flat single file with tabstop as separators, while CLDF supports multiple files.CLDF furthermore encourages the use of reference catalogs, such as Glottolog (Hammarstro ¨m et al., 2017) or Concepticon (List et al., 2018), in order to increase the comparability of linguistic data across datasets, while LingPy is indifferent regarding the overall comparability as long as the data is internally consistent.As of version 2.6, LingPy offers routines to convert to and from CLDF (see Tutorial: 6.3). Figure 2 provides a basic summary on LingPy's input formats.More information on the format, and how it can be loaded into LingPy can be found in the supplemented interactive tutorial (Tutorial: 2.2-3).
Data quality and consistency plays a crucial role in the outcome of an automatic sequence comparison.As a general rule of thumb, we recommend all linguists who apply LingPy or other software to carry out automatic sequence comparison, to pay careful attention to what we call the SANE rules for data sanity: users should pay close attention to providing a sensible segmentation of their data, they should aim for high coverage, there should be no mixing of data from different sources (as this usually leads to inconsistent transcriptions and may also increase the number of synonyms), and synonyms should be evaded. 2These rules are summarized in Table 2.If the original data does not provide reliable phonetic transcriptions, as it was the case with the Polynesian data we use in this tutorial, orthography profiles (Moran and Cysouw, 2017) provide an easy way to refine transcriptions while at the same time segmenting the data, and the EDICTOR tool (List, 2017) offers convenient ways to check phonological inventories of all varieties (Tutorial: 2.4).Various coverage statistics can be computed in LingPy (see Tutorial: 2.5).Synonym statistics can also be easily computed (see Tutorial: 2.6).Users should always keep in mind that the quality of automatic sequence comparison crucially depends on the quality of the data submitted to the algorithms.

Automatic phonetic alignment
Alignments are crucial for historical language comparison to search for regular sound correspondence patterns, layers of borrowed words, or even use them as the starting point for linguistic reconstruction (Fox, 1995).A further important advantage is that they can be easily quantified, as we will see in Section 5. Since phonetic alignment is heavily influenced by bioinformatics, linguists using phonetic alignments should have some basic understanding of original algorithms and terminology.In this context, it is not necessarily important to understand how the algorithms work in detail.Instead, we think it is more important to learn (also by testing the algorithms with different data and parameters) how the different options from which users can choose influence the results.In the following, we will quickly introduce basic algorithms and concepts involving alignments in historical linguistics, and how they relate to alignments in bioinformatics.We will follow the traditional division into pairwise and multiple alignments (which result from the differences in complexity of the algorithms), and introduce the most important concepts and parameters that users should know when applying the methods.

Pairwise alignment analyses
Pairwise alignment analyses in biology and computer science date back to the 1970s when scholars like Needleman and Wunsch (1970), and Wagner and Fischer (1974) proposed algorithms based on the dynamic programming paradigm (Eddy, 2004b) which drastically reduced the computation time for the task of aligning two sequences with each other.The basic idea of the algorithms by Needleman and Wunsch and Wager and Fischer was to split the problem of finding one optimal alignment between two sequences into subparts and building the general solution from optimal alignments of smaller subsequences (Durbin et al., 2002: 19). 3  The major parameters of pairwise alignment algorithms are the scoring function, the gap function, and the alignment mode.The scoring function (Fig. 3A, Tutorial: 3.1.1)determines how the matching of segments is penalized (or favored).In biology, it is well known that amino acid mutations follow certain transition preferences.The scoring function defines transition probabilities for each segment pair, and biologists make use of a large number of empirically derived scoring functions (Eddy, 2004a).In linguistics, on the other hand, we know well that certain sounds are more likely to occur in correspondence relations with each other (Dolgopolsky, 1964;Brown et al., 2013), and this knowledge can be used as a proxy when designing a scoring function in linguistics.While biology deals with Fila [e u tS i] 'all' Aim for high coverage Each language should have about the same number of words recorded across the wordlist.A high mutual coverage is important to allow algorithms to find enough information to determine the major signal. NOT: No mixing of data from different sources Mixing data for the same language from various sources can lead to inconsistencies in the phonetic representation of words, even if they are all given in plain phonetic transcriptions.This will weaken the evidence for regular sound correspondences. NOT: Evade synonyms Languages often have multiple words for a given meaning.However, these can cause problems for sequence comparison and further downstream analyses like phylogenetic reconstruction.Having abundant synonyms in the data (e.g.40 words for snow) will necessarily blur this signal. NOT: Tahitian 'sea' small alphabets, in linguistics, the numbers of possible sounds in the languages of the world amounts to the thousands (Moran et al., 2014).It is not practical to design a matrix containing and confronting all sounds with each other, and most algorithms reduce the size of the alphabet by lumping similar sounds into a set of predefined sound classes (Fig. 3B, Tutorial: 3.1.2),for which transition probabilities can be efficiently defined, and which are then given as input for the alignment algorithm (List, 2012a;Holman et al., 2008).The introduction of gaps in an alignment (Fig. 3C, Tutorial: 3.1.3)can be seen as a special case of a scoring function.Instead of comparing two segments, the algorithm checks whether the introduction of a gap might be preferable.While gaps were originally given the same penalty, independent of the element with which they were compared, later studies showed that they could even be individually adjusted for each position in a sequence (Thompson et al., 1994).In linguistics, we know that sounds in certain positions (like initial consonants) are less likely to be lost and that new sounds tend to appear in specific contexts as well.In LingPy, positionspecific gap penalties are derived from the prosodic profiles of sequences (List, 2012a).Prosodic profiles essentially reflect for each segment of a word whether it occurs in weak or strong prosodic positions, and the user-defined gap penalty is modified accordingly.
The alignment mode (Fig. 3D, Tutorial: 3.1.4)basically determines which parts of individual sequences are compared.It is often impossible to compare two words entirely.Instead, we compare only certain parts of which we know that they are cognate, ignoring parts of which we know they are not.Since the same problem occurs when comparing the genes of diverse species in bioinformatics, biologists have long since been working on solutions, reflected in local alignment analyses (Smith and Waterman, 1981) in which only the most similar parts of sequences are compared (see Fig. 3), while the rest is ignored, or semi-global alignments (Durbin et al., 2002: 26f).
What should users keep in mind when carrying out pairwise alignment analyses?As a rule of thumb, we recommend caution with local alignment analyses, since these can show unexpected behavior.We also recommend care with custom changes applied to the scoring or the gap function.Users often naively think by just 'telling' the computers which sound changes, this would automatically lead to excellent alignments and at times complain that LingPy's standard algorithms fail to 'detect certain obvious changes'.However, alignments are no way to determine sound changes, they are at best a first step for linguistic reconstruction, and none of the algorithms which have been proposed so far models any kind of change.What is modeled instead are correspondences of sounds.It is difficult, if not impossible, to design an algorithm that aligns sequences of all kinds of diversity without proposing certain analyses which look awkward to a trained linguist.But remember, automatic sequence comparison is not there to replace the experts, but to help them.

Multiple alignment analyses in linguistics
Pairwise alignments are crucial for most automatic cognate detection methods (List, 2014b;Ja ¨ger et al., 2017).In order to visualize cognate judgments, or to reconstruct proto-forms, however, pairwise alignments are not of great help, as most linguistic research applies to at least three if not more language varieties.It may sound counterintuitive for readers not familiar with the major workflows for automatic cognate detection that pairwise alignments are mainly used to detect cognates across multiple languages, while multiple alignments are only later computed from existing cognate sets.Why not compute multiple alignments right from the beginning, as for example, proposed by Wheeler and Whiteley (2015)?The reason for this workflow is that alignments only make sense when representing cognate wordsaligning unrelated words just leads to chance similarities.
For reasons of algorithmic complexity, pairwise alignment algorithms cannot simply be rewritten to account for an arbitrary number of sequences.In order to address this problem, early approaches used heuristics that approximate optimal multiple alignments (Feng and Doolittle, 1987;Thompson et al., 1994).Most of these algorithms compute pairwise alignments in a first step and then combine the data in a pairwise fashion until all alignments are merged into one multiple alignment.The easiest way to do so is with help of a guide tree, a clustering of all sequences, which determines in which order sequences are merged with each other.This procedure is illustrated in Fig. 4 for the alignment of four words for 'dog' in four Polynesian languages (Tutorial: 3.2).
Many extensions of the classical guide-tree heuristics have been proposed in the biological literature (Notredame et al., 2000;Morgenstern et al., 1998) and also adapted in linguistic applications (List, 2012a;Ja ¨ger and List, 2015;Hruschka et al., 2015).While the fine-tuning of the algorithms may have a solid impact on multiple alignment analyses involving large sets of language varieties, as we often encounter in dialectology (compare the results of Proki c et al., 2009 with List, 2012a), the problem of erroneous alignments is much less pronounced when using smaller datasets and working in workflows which start from cognate detection and compute multiple alignments in a later stage.For these reasons, we refrain from giving more detailed descriptions of multiple sequence alignment here, but instead refer the readers to the literature that we quoted in this section and the examples in the interactive tutorial (Tutorial: 3.2).

Automatic cognate detection
As mentioned in the previous section, we can only meaningfully align words if we know they are historically related.In order to identify which words are related, however, we still need to compare them, and most automatic approaches, including the core methods available in LingPy, make use of pairwise sequence comparison techniques in order to find historically related words in linguistic datasets.
The basic workflow of most automatic cognate detection methods can be divided into two major steps.In the first step, pairwise alignment is used to align all words to retrieve distance scores for each pair of words in the data which occur in the same concept slot.If normalized, distance scores typically rank between 0 and 1, with 0 indicating the identity of the objects under comparison, and 1 indicating the maximal difference that can be encountered for the objects.In a second step, these distances are used to partition the words into presumable cognate sets using tree-or network-based partitioning algorithms.If we take five words for 'neck' from our Polynesian data, Ra'ivavae [?agapo?a], Hawaiian [?a:?i:], Mangareva [kaki], Maori [ua], and Rapanui [˛ao], for example, we can use the normalized edit distance (NED) to compare all four words with each other and write the results into a matrix, as shown in Table 5A. 4 In Table 5B, we have carried out the same pairwise comparison, but this time with a different sequence comparison measure, following the sound-class-based alignment method (SCA, List 2012a), in which the idea of sound classes is combined with sequence alignment methods.Table 5C shows the results retrieved from the LexStat method (List, 2012b) which derives distances from a previous search for regular sound correspondences.As can be seen, when comparing only the matrices, the methods generally differ in the way they handle sequence similarities.While NED has rather high scores which do not vary much from each other, SCA has consistently smaller scores with more variation, and LexStat has higher scores but more variation than NED.
In the second step, the matrix of word pair distances is used to partition the words into cognate sets.For this, partitioning algorithms are used which split the words into cognate sets by trying to account as closely as possible for the pairwise distances of all words in a given meaning slot.Early approaches were based on a flat version of the well-known UPGMA algorithm (Sokal and Michener, 1958), which is an agglomerative cluster algorithm that returns the data in the form of a tree.The flat variant of UPGMA stops merging words into bigger subgroups once a user-defined threshold of average pairwise distances among the words in each cluster has been reached (List, 2012b).In order to show how algorithms arrive from pairwise distance scores in a matrix at cognate set partitions, we provide a concrete example in Fig. 5. First, we have marked all cells in which the distance is smaller than the recommended threshold for each method (following List et al., 2017b). 5Second, we added guide trees (reflecting the clustering proposed when applying the UPGMA algorithm without stopping it earlier) below each matrix, which show how the flat clustering algorithm proceeds.If the algorithm stops grouping words into a given cluster, because the average threshold has been reached, this is indicated by a dashed line, which indicates how the clustering would have proceeded if the algorithm had not stopped.Given that we know that of these five words in the figure, only Hawaiian [?a:?i:] and Mangareva [kaki] are cognate, we can immediately see that the LexStat algorithm is proposing the correct cognates in this example.
The performance of LexStat is not surprising, if we take its more sophisticated working procedure into account.LexStat uses global and local pairwise alignments to pre-analyze the data, computing language-specific scoring functions (List, 2012b), in which the similarity of the segments in a given language pair depends on the overall number of matches that could be found in the preprocessing stage. 6In these scoring functions, sound segments for all languages in the data are represented as sound-class strings in a certain prosodic environment.This representation is useful to handle sound correspondences in different contexts (word-initial, wordfinal, etc.).For each language pair in the data, LexStat creates an attested and an expected distribution of sound correspondences.The attested distribution is computed for words with the same meaning and whose SCA score is beyond a user-defined threshold.The expected distribution is computed by shuffling the word lists in such a way that words with different meanings are aligned and compared, with the users defining how often word lists should be shuffled.This permutation test following suggestions by Kessler (2001) makes sure that the sound correspondences identified are unlikely to have arisen by chance.The distributions resulting from this permutation test are then combined in log-odds scores (see Fig. 3 above) which can then in turn be used to realign all words and determine their LexStat-distance. 7These scores are then again used to create a matrix of pairwise distances as shown in Fig. 5. Our interactive tutorial shows how input data can be quickly checked before carrying out the (at times time-consuming) computation (Tutorial: 4.1) and provides additional information regarding the differences between the cognate detection methods available in LingPy (Tutorial: 4.2) and illustrates in detail how each of them can be applied (Tutorial: 4.3).
More recent approaches for cognate set partitioning use Infomap (Rosvall and Bergstrom, 2008), a community detection algorithm which uses random walks in a graph representation of the data to identify those clusters in which significantly more edges can be found inside a group than outside (Newman, 2006).In order to model the data as a graph, words are represented as nodes and distances between words are represented as edges which are drawn between all nodes whose pairwise distance is beyond a user-defined threshold (List et al., 2017b).Recent studies have shown that the graph-based partitioning approaches slightly outperform the flat agglomerative clustering procedures (List et al., 2016b(List et al., , 2017b;;Ja ¨ger et al., 2017).
The advantage of LexStat and similar algorithms is that the algorithm infers a lot of information from the data itself.Instead of assuming language-independent distance scores which would be the same for all languages in the world, it essentially infers potential sound correspondences for each language pair in separation and uses this information to determine language-specific distance scores.The disadvantages of LexStat are the computation time and the dependency of data with high mutual coverage.It was designed in such a way that it refuses to cluster words into cognate sets if sufficient information is lacking.As a rule of thumb, derived from earlier studies (List, 2014a), we recommend applying LexStat only if the basic concept lists of a given dataset consists of at least 200 words, and if the mutual coverage of the data exceeds 150 word pairs.If the data is too sparse, such as, for example, in the ASJP database (Wichmann et al., 2016) which gives maximally 40 concepts per language, we recommend to use either the SCA approach, or to turn to more sophisticated machine learning approaches (Ja ¨ger et al., 2017), which have been designed and trained in such a way that they yield their best scores on smaller datasets.In all cases, users should be aware that the algorithms may fail to detect certain cognates.The reasons range from rare sound correspondences which can trigger problematic alignments, via sparseness of data (especially when dealing with divergent languages), up to problems of morphological change which may easily confuse the algorithms as they may yield partial cognates and produce words that cannot be fully aligned anymore (List et al., 2017b).In Table 3, we summarize some basic differences between the four methods mentioned so far.
Once the words have been clustered into cognate sets, it is advisable to align all cognate words with each other, using a multiple alignment algorithm (Tutorial: 4.4).Alignments are useful in multiple ways.First, users can easily inspect them with web-based tools (Tutorial: 4.5).Second, they can be used to statistically investigate the identified sound correspondence patterns in the data (see Tutorial: 4.6).Both the manual and the automatic check of the results provided by automatic cognate detection methods are essential for a successful application of the methods.Only in this way can users either convince themselves that the results come close to their expectations or that something weird is going on.In the latter situation, we recommend that users thoroughly check to which degree they have conformed to our SANE rules for dataset sanity outlined above in Section 3. We also recommend that users do not change the different parameters too much, especially when applying LingPy the first time.Instead of trying to fix minor errors (such as obvious cognates missed or lookalikes marked as cognates) by changing parameters, it is often more efficient to correct errors manually.Although Rama et al. (2018) report promising results on fully automated workflows, we do not recommend relying entirely on automatic cognate detection when it comes to phylogenetic reconstruction, since the algorithms tend to be too conservative, often missing valid cognates (List et al., 2017b), but we are confident enough to recommend it for initial data exploration, and for the preparsing of data in order to increase the efficiency of cognate annotation.

Evaluation
We have claimed above that automatic cognate detection had made great progress of late.We make this claim based on tests in which the performance of automatic cognate detection algorithms was compared with expert cognate judgments (List et al., 2017b).There are different ways to compare expert cognate judgments with algorithmic ones.A very simple but nevertheless important one is to compare different cognate judgments manually, by eyeballing the data.Even if one lacks expert cognate judgments for a given dataset, this may be useful, as it helps to get a quick impression on potential weaknesses of the algorithm used for a given analysis.Comparing cognate judgments in concrete, however, can be quite tedious, especially if the data are not presented in any ordered fashion.For this reason, LingPy offers a specific format that helps to compare different cognate judgments in a rather convenient way.How this comparison can be carried out is illustrated in Table 4, where we use the numeric annotation for cognate clusters as described in Fig. 6 to compare expert cognate judgments for 'to turn' in eight East Polynesian languages with those produced by edit distance, the SCA, and the LexStat method, respectively.As can be seen from the table, NED lumps all words into one cluster, obviously being confused by the similarity of the vowels across all words.SCA comes close to the expert annotation, but wrongly separates Hawaiian [wili] from the first cluster, obviously being confused by the dissimilarity of the sound classes.LexStat correctly identifies all cognates, obviously thanks to its initial search for language-specific similarities between sound classes.In the interactive tutorial, we show how users can compute similar overviews on differences in cognate detection analyses and conveniently compare them (Tutorial: 5.1).
While manual inspection is important, it is also crucial to have an independent and objective score that tells us how well algorithms perform on a given dataset.Knowing the approximate performance may, for example, be useful when working with large datasets which would take too long to be analyzed manually.If we annotate part of the data and see that the automatic methods perform well enough, we could then use the automatic approaches to carry out our analyses and report the expected accuracy in the study.Our recommended evaluation measures are B-Cubed scores (Bagga and Baldwin, 1998;Amigo ´et al., 2009), which Hauer and Kondrak (2011) first introduced as a measure to assess the quality of cognate detection algorithms compared to expert judgments.
The details of how B-Cubed scores are computed are explained elsewhere in detail (List et al., 2017b), and it would go beyond the scope of this tutorial to introduce them here again.For users interested in automatic cognate detection, but reluctant in learning in depth about evaluation measures in computational linguistics, it is sufficient to know how the B-Cubed scores should be interpreted.Usually the scores are given in three forms, which all rank between 0 and 1: precision, recall, and F-Score.Precision comes closest to the notion of true positives in historical linguistics.Recall is close to the notion of true negatives, accordingly, and the F-Score, the harmonic mean of precision and recall, can be seen as a general summary of the two, derived by the formula 2 PÂR PþR , where P is the precision and R is the recall.If the scores are high, this means the algorithms come close to the judgment of the experts, a score of 1.0 in precision and recall (and therefore also the F-Score) means that the results are 100% identical.
In Table 5, we report the results achieved by four automatic cognate detection methods on a small subset of ten East Polynesian languages which we retrieved from our Polynesian dataset for illustrative purposes. 8 In addition to the three methods reported already in Table 4, we added a random cognate detector which was sampled from 100 trials, and the Infomap version of the LexStat algorithm (LS-Infomap), in which the cognate set partitioning is carried out with the Infomap algorithm instead of the flat version of UPGMA (see Section 5 above). 9NED shows a rather low precision compared to the other nonrandom approaches, indicating that it proposes many false positives (as we could see above in Table 4).On the other hand, its recall is very high, indicating that it does not miss many cognate sets.SCA obviously has a lot of problems with the data, performing worse than NED in general, with a rather low precision and recall.Both LexStat approaches largely outperform the other approaches in general, and especially the very high precision is very comforting, since it indicates that the algorithms do not propose too many false positives.That the Infomap version of LexStat   performs better than LexStat with UPGMA is also shown in this comparison, although the differences are much lower than reported in List et al. (2017b).It would be very interesting to compare the scores we achieved with general scores of levels of agreement among human experts.Unfortunately, no systematic study has been carried out so far. 10The interactive tutorial gives a detailed introduction into the computation of B-Cubed scores with LingPy (Tutorial: 5.2).Given the differences in the results regarding precision, recall, and generalized F-scores, it is obvious that the choice of the algorithm to use depends on the task at hand.If users plan to invest much time into manual data correction, having an algorithm with high recall that identifies most of the cognates in the data while proposing a couple of erroneous ones is probably the best choice.Users can achieve this by choosing a high threshold or an algorithm such as NED, which yields a rather high recall in form of the B-Cubed scores, at least for the Polynesian data in our sample.In other cases, however, when usercorrection is not feasible because of the size of the dataset, it is useful to choose low thresholds or generally conservative algorithms with high B-Cubed precision in order to minimize the amount of false positives.

Data export
LingPy provides direct export of the cognate judgments to the Nexus format (Maddison et al., 1997), allowing users to analyze automated cognate judgments with popular packages for phylogenetic reconstruction, such as SplitsTree (Huson, 1998), MrBayes (Ronquist et al., 2009), or BEAST 2 (Bouckaert et al., 2014, see Tutorial: 6.1).If phylogenetic trees are computed from distance matrices, both matrices and trees can be written to file and further imported in software packages for tree manipulation and visualization (Tutorial: 6.2).In addition, data can be exported (and also be imported) to the wordlist format proposed by the CLDF initiative (Forkel et al., 2017), which is intended to serve as a generic format for data sharing in cross-linguistic studies (Tutorial: 6.3).

Concluding remarks
In this tutorial we have tried to show how automatic sequence comparison in LingPy can be carried out.Given the scope of this article, it is clear that we could not cover all aspects of alignments and cognate detection in all due detail.We hope, however, that we could help readers understand what they should keep in mind if they want to carry out sequence comparison analyses on their own.Additional questions will be answered in an interactive tutorial supplemented with this article, and for deeper questions going beyond the pure application of sequence comparison algorithms-such as additional analyses (e.g. the minimal lateral network method for borrowing detection, List et al., 2014, or an algorithm for the detection of partial cognates, List et al., 2016b), routines for plotting and data visualization, or customization routines for user-defined sound-class modelswe recommend the readers to turn to the extensive online documentation of the LingPy package (http:// lingpy.org).We have emphasized multiple times throughout this article that the algorithms cannot and should not be used to replace trained linguists.Instead, they should be seen as a useful complement to the large arsenal of methods for historical language comparison which can help experts to derive initial hypotheses on cognacy, speed up tedious annotation of cognate sets, and increase their efficiency and consistency.

Figure 1 .
Figure 1.Early alignment example for translational equivalents of 'nail' in aboriginal languages of California (based onDixon and Kroeber, 1919), contrasted with a 'modern' representation using the EDICTOR tool(List, 2017).

Figure 2 .
Figure2.Input format required by the LingPy package.The last two entries show how synonyms can be handled by placing different variants of one concept in one language variety into different rows with a separate ID each.

Figure 3 .
Figure 3. Basic parameters and concepts in pairwise alignment analyses: (A) Scoring function, (B) Sound classes, (C) Gap function and (D) Alignment mode.

Figure 4 .
Figure 4. Combining words for 'dog' in Samoan, Hawaiian, North Marquesan, and Anuta into a multiple alignment with help of a guide tree.

Figure 5 .
Figure5.Contrasting distances retrieved from three different alignment approaches for Polynesian words for 'neck'.Cells highlighted indicate that distances are smaller than the default threshold for the algorithms.The first column of each table indicates the cognate decisions resulting from the matrix and the threshold.How these cognate decisions are determined is further illustrated in the trees below each matrix.They show how a flat cluster algorithm which stops once a certain threshold is reached can be used to partition the words into cognate sets.

Figure 6 .
Figure6.Some basic concepts important for automatic cognate detection.

Table 1 .
Recurring similarities in Sikaiana and Tahitian.

Table 2 .
SANE rules for data sanity.matters Consistent phonetic transcription and segmentation are of crucial importance for automatic sequence comparison.Computers cannot guess whether multiple graphemes represent separate or single sound segments.

Table 3 .
Comparing different algorithms for cognate detection implemented in LingPy with respect to some fundamental parameters of sequence comparison.

Table 4 .
Comparing automatic cognate detection methods with expert cognate judgments for words for 'to turn' in East Polynesian languages.

Table 5 .
B-Cubed scores for different cognate detection algorithms compared against a test set of East Polynesian languages.
Highlighted cells indicate the best scores for a given measure.