A digital corpus resource of authentic anonymized French text messages: 88milSMS—What about transcoding and linguistic annotation?

In 2011, six academics gathered over 90,000 authentic text messages (SMS) in French from the general public, in compliance with French law (http://sud4s-cience.org, Panckhurst et al ., 2013). The SMS ‘donors’ were also invited to fill out a sociolinguistic questionnaire (see Figure A1, Moı¨se, 2013, Panckhurst and Moı¨se, 2014). The ‘sud4science’ project is part of a vast international initiative, entitled ‘sms4science’ (http://www.sms4science.org/, Fairon et al. , 2006, Cougnon and Fairon, 2014, Cougnon,

In this article, after briefly evoking the anonymization process, I focus on why we decided to exclude full 'transcoding' and linguistic annotation from the final processing of the first version of the 88milSMS corpus. Although anonymization is explained in depth elsewhere (Accorsi et al., 2014, Patel et al., 2013, I have provided a summary of the problems involved here, as well as a list of the anonymization tags used for 88milSMS, since these are similar in structure to those provided for the linguistic annotation, which is described later.

Anonymization
Anonymization of private data is crucial and a legal requirement, which was closely monitored by the University's legal specialists. It took eight student internships and 21 months to accomplish the non-trivial three-step semi-automatic anonymization task, involving computational linguistic techniques.
A piece of software was especially devised by students to semi-automatically anonymize the first/last names, nicknames, (email) addresses, places, telephone numbers, codes, URLs, tradenames, etc., appearing in the SMS data, collected within the 'sud4science' framework (Accorsi et al., 2014, Patel et al., 2013. Of course, SMS writing is often very creative, rendering the anonymization process highly difficult: first names may (or may not) be capitalized (Ce´dric/ce´dric), characters may be repeated (Ce´e´e´e´e´e´dric), diminutive/abbreviated forms appear (Ge´ge´for Ge´rard, JP for Jean-Pierre), etc.
The first automatic step meant 72% of the corpus was anonymized (dictionary comparison allowed proper names such as Ce´dric to be anonymized, whereas words such as crayon ('pencil', in English) were discarded, since they belonged to one of the 'anti-dictionaries' used: LEFFF (Dictionary of inflected forms of the French language, Sagot, 2010; Dictionary of some SMS writing forms; Dictionary of place names).
The second semi-automatic step required human expert intervention to discriminate between items 88milSMS Digital Scholarship in the Humanities, Vol. 32, Supplement 1, 2017 i93 requiring anonymization and those that remained unchanged (Pierre/pierre corresponds to 'Peter' or 'stone' in French, depending on the context). All words not contained in either the dictionary or one of the anti-dictionaries (automatic phase), or that had been highlighted as ambiguous candidates (semi-automatic phase), were considered 'unknown' and also highlighted (semi-automatic phase). This is summarized in Table 1.
The third validation phase (conducted by student linguist interns) was important for confirming or modifying previous automatic decisions: (n8 18307) grace a lui on comprend trop bien franchement ke kiffe la physique cette anne meme si cest bien dur [. . .] thanks to him we really understand frankly I love physics this year even if it's really hard Example 1: Validation phase.
(n8 81793) C bon tu peux m appeler sur mon fixe <TEL_10 > <PRE_4> It's ok you can call me on my landline (ten character telephone number, four characters in first name) Example 2: Anonymized SMS.
We have provided a sample of 1,000 'raw' text messages transcoded into standardized French and another sample of 100 linguistically annotated SMS. Why decide to exclude 'full' transcoding and annotation phases in the first version of the final corpus?

Transcoding
Transcoding 'raw' text messages into 'standardized' French means morpho-syntactic parsers and other natural language processing (NLP) tools can ultimately analyse them. Concerning the terminology, the 'sud4science' team deliberately chose to use 'transcoding', since it can be defined as converting from one form of coded representation to another. This allows to discriminate between oral speech (to written) 'transcription' techniques and written (to written) 'transcoding' ones, such as SMS data. From a linguistic point of view, one can also use the mainstream 'standardization', a synonym that we indeed used previously, along with 'normalization', which we prefer to use when faced with computational linguistics matters . Here, I have maintained 'transcoding'.
Checking spelling and grammar facilitates comprehension, but 'no' supplementary information should be 'injected'.
'Raw' anonymized SMS (n8 22446): En fait c rien de spécial, jprends juste un peu de recul et jcomprends pas ce que jfous là, fac, psycho, montpellier, pourquoi simplement je vis, enfin bref rien de grave. Qu'est ce qui cloche chez toi? Anonymized and transcoded SMS: En fait c'est rien de spécial, je prends juste un peu de recul et je comprends pas ce que je fous là, fac, psychologie, Montpellier, In Example 3 above, the French negation 'ne' is not re-inserted (ce n'est rien, je ne comprends pas), since in oral forms, this is quite common and the negation 'pas' is sufficient for a parser. Prepositions/ articles (« à la fac », « en psychologie », « à Montpellier ») are not 'reinjected' either, since automatic processing is possible without them. However, for abbreviated and agglutinated forms ('c' ¼ > 'c'est'; 'jprends' ¼ > 'je prends') transcoding into standardized French is necessary, so that a morpho-syntactic parser can automatically process the sentence. The apocope 'fac' (instead of 'faculté', for University) has not been modified since the researchers decided to validate the transcoding in relation to the online French Petit Robert (PR, 2014) dictionary. If a lexical item appears therein, it is not transcoded in the corpus. Here, 'psycho' is transcoded into 'psychologie' because it does not appear as such in the dictionary. The PR includes certain popular forms, such as 'frérot' (brother), foreign words: 'week-end', acronyms: 'lol', French inverted forms ('verlan'): 'relou' (lourd/that's a pain), etc. These are not transcoded into standardized French. Typographical norms are also reinserted; in this example, a space before the question mark in French and a capital 'M' for the city of Montpellier.
What if a texter tries to simulate a certain form of oral French, for instance, by using an apostrophe, or through agglutination ('j'sais'¼'je sais', 'chuis'¼'je suis') as shown above? Should these items be transcoded or not? What about punctuation, often absent in text messages? Should one re-introduce this systematically? Example three shows how difficult the transcoding process can be.
Researchers may well have differing theoretical viewpoints on these matters. In November 2011, the Montpellier team invited researchers involved in previous sms4science data collections to a twoday workshop to exchange views on harmonization/ standardization techniques related to anonymization, transcoding, and annotation for processing SMS written data. Over and above compulsory anonymization, some teams had either partially or entirely transcoded their SMS 'raw' data into standardized French and conducted linguistic annotation. Others had not. It is extremely difficult to agree on standardized ways to proceed, owing to varying theoretical views, or (pluri)disciplinary positions. For instance, in one of our seminars, two psychologists, Goumi and Bernicot (2011), presented some of their transcoded data. One of the 'raw' SMS examples they provided was as follows: 'Lèa t c se kil i a fair en techno'. This more or less translates to the following: 'Léa, do you know what we have to do in technology?' The example was transcoded-following their specifications-so as to maintain 'oral forms' ('Léa t'sais ce qu'il y a à faire en techno') and 'a formal academic normed transcription' was then provided ('Léa sais-tu ce qu'il y a à faire en technologie?'). In this case, they chose to radically transform the original SMS, with, among other aspects, questions with subject pronoun þ verb inverted forms ('t c'/'t'sais'/'sais-tu'), contractions or apocopes ('techno'/'technologie'), phonetic variations, ellipsis, etc. ('kil i a fair'/'qu'il y a à faire'). These transcodings may suffice for psychologists, but they would most certainly cause debate for linguists, who would be inclined to have differing views on acceptable transcodings, from oral/written/computational linguistics perspectives. I actually set up a transcoding exercise with my colleagues to check these differences. I chose a sample of 1,000 text messages and submitted it to them: there are two computer scientists involved with NLP, one computational linguist (CL), two discourse analysis linguists, and one sociolinguist. The conclusion was radical: we had transcoded the extract depending on our discipline areas. For those involved in NLP and CL, it was important to take into account the fact that the sample could be processed by a machine, therefore 't' from the above example would need to be transcoded into 'tu', whereas for a linguist who is used to working with oral transcriptions, this is unjustified and perceived as 'injecting' an interpretation which is initially absent. The list goes on and on. Even though manual transcoding is not a viable option for standardization of subsequent versions of the 88milSMS corpus, normalization using automated NLP techniques has been researched by our team (see Section 5, Lopez et al., 2014).

Annotation
Another issue is linguistic annotation of the corpus (Ide and Pustejovsky, forthcoming). For example, the 'raw' SMS 'je met tout ça de coté et peux tout encaisser juste pour toi.' (I'm leaving all of that aside and I can bear it all just for you.) could be transcoded into standardized French as follows: 'Je mets tout ça de côté et je peux tout encaisser juste pour toi.' It could then be linguistically annotated with information of interest to researchers, among other items: spelling, grammatical information, emoji insertion, code-switching, typography, missing accents, voluntary modification, etc. Therefore, I define linguistic annotation of SMS data for the 88milSMS corpus, as 'interpretative' linguistic information indicated via appropriate tags (see below), related to the difference between a 'raw' text message and its transcoded equivalent in standardized French. I do not include in this definition, lemmatization, or part-of-speech (POS) tagging (see Section 5), which do indeed also correspond to other methods of linguistic annotation (based mainly on providing lexico-morpho-syntactic information).
Examples of these tags appear in Table 2 (note that only one type of tag appears per SMS to facilitate reading).
In n8 7063, <TYP> indicates a missing space before punctuation (necessary in French). In n8 43927, <LAN> refers to a word which is borrowed from English ('fight'). The emoticon in n8 6887 is easy to recognize. Annotation involving double (or more) tags may also be necessary in some situations: n8 5409, T'y vas à quelle heure? Nous on y est dans 10 minutes <EMO_TYP_missing space>^Ŵ hat time are you going there? We'll be there in ten minutes^n 8 43818, Oww emm gee <MOD_LAN> neighb !! La saison 3 de vampire diaries est juste incroyable! OMG neighbour!! Season 3 of Vampire Diaries is just incredible! n8 49721, C est pas TOI le pb le pb <TYP_ORT>c edt le groupe! It's not YOU the pb the pb was the group! Example 5: Unambiguous double tags.
The emoticon in n8 5409 has a missing space before it; thus, <TYP> is also a necessary tag. In n8 43818, 'neighbour', which appears in English <LAN>, has been shortened to 'neighb', thus justifying the <MOD> tag. In n8 49721, 'c edt' (c'est) has a missing apostrophe <TYP> and a typing mistake <ORT>.
In other situations, however, it might be difficult to decide which tag(s) to choose: n8 49808, <MOD? > <ORT?>bone journè Have a nice day n8 11682, Il <GRA? > <MOD?>es rentrer a 22h30 et jai eu ldroii au : jsui fatiguer, jai mal a la tete jvai me coucher. He came home at 10.30pm and I got to hear: I'm tired, I have a headache, I'm going to bed Example 6: Tag choice.
In n8 49808, the 'scriptor' may have voluntarily modified the two words ('Bonne journée') or may have lacked spelling knowledge. So should <MOD> and/or <ORT> be used? In n811682, 'rentrer' ('Il est rentré') could be either a grammatical mistake <GRA> or the scriptor may have preferred using an 'r' <MOD> instead of pressing the 'e' to access the acute accent (on a smartphone).
Sometimes, researchers may well disagree with the choice of tags. In Example 7, below, should one indicate that a subject pronoun is 'missing'? The 'absence' or 'ellipsis' notion may not be relevant for certain researchers. For instance, for a CL, in Example 7, the subject pronoun 'je' (I) is missing, and may be categorized as an 'ellipsis'. For other linguists, for instance, those working on oral forms, the ellipsis/absence idea is irrelevant because one should merely accept the example, as it was spoken/written in the first place-from this point of view, nothing is 'missing', as such. Punctuation and typography are also an important issue. To what extent should they be 'reintroduced' if absent? This is a highly frequent situation in text messages.

Conclusion
We decided to limit the processing to two extracts. Our (rare) choice to exclude full transcoding and tagging is a theoretical position: linguistic annotation of SMS data (as we have defined it, cf. Section 4) is far from neutral. It is directly linked to an interpretative framework. A true consensus on how to standardize the transcoding and linguistic annotation does not exist, owing to differing/varying theoretical, (pluri)disciplinary, and scientific stances. McEnery and Hardie (2012) comment on the two sides of the coin, weighing up the pros and cons of corpus annotation: Arguments against annotation are largely predicated upon the purity of the corpus texts themselves, with the analyses being viewed as a form of impurity. This is because they impose an analysis on the users of the data, but also because the annotations themselves may be inaccurate or inconsistent [. . .]. Such claims are interesting because, as has been noted, corpus annotation is the manifestation within the sphere of corpus linguistics of processes of analysis that are common in most areas of linguistics. To identify problems with accuracy and consistency, in corpus annotation is, in principle at least, to identify flaws with analytical procedures across the whole of linguistics. It is because of the issues of accuracy and consistency, in particular, that some linguists prefer to use unannotated corpora. But this does not mean to say that such linguists do not analyse the data they use; rather, it means that they leave no systematic record of either their analysis or their errors which can easily and readily be tied back to the corpus data itself. (McEnery and Hardie, 2012, p. 14) We believe that mark-up initiatives should not be imposed upon researchers; it seems more relevant to let them conduct their own annotation bearing their specific scientific questioning in mind, without being trapped within a unique theoretical framework.
Another alternative is that researchers may of course prefer to provide both 'raw' and tagged corpora: 'Dissemination will take two different forms: one version of a corpus with the ''raw'' text without any tokenization and annotation (v1), and a second version of the same corpus with the annotations (v2).' (Chanier et al., 2014, p. 2). For instance, Riou and Sagot (2016) present morpho-syntactic tagging of a specific corpus within the French CoMeRe corpora repository (v2), following on from a previous version without it (v1).
The 88milSMS digital corpus resource will provide inspiration for many years to come. Our corpus can be used to analyse contemporary mediated electronic discourse, from a (pluri)disciplinary perspective (linguists, communications specialists, psychologists, sociologists, computer data specialists, etc.), build knowledge on SMS writing forms (Panckhurst 2009, Roche et al., forthcoming), and let algorithms learn from this: alignment methods for facilitating automatic transcoding/standardization/normalization are currently being explored , following Aw et al., 2006, Beaufort et al., 2008, Guimier de Neef and Fessard, 2007, Kobus et al., 2008, as are methods for classifying 'unknown' items for use in automatically identifying lexical 'creativity' within 88milSMS and also to improve electronic dictionary approaches . If normalization techniques can be truly implemented for processing 88milSMS, then lemmatization and POS-tagging may also be envisaged, since the latter currently include a high error ratio (if tools are used on 'raw' text messages). In Lopez et al. (2016) we specified the following as our next step: In order to refine automatic normalisation techniques for initially non-standard texts in French, the next logical step is to compare our resource with different types of instant media (i.e. SMS, forums, tweets). Firstly, a new typology of the detected 'mistakes', based on existing typologies, will be elaborated. Secondly, automatic normalisation techniques-focussing on the most frequent errors-will be proposed. These will then be confronted with traditional automatic translation (Vilariño et al., 2012), speech recognition (Kobus et al., 2008) and spelling/grammatical checker principles (Beaufort et al., 2010). Finally, the approach should enable comparison between different types of instant media. .
The resource also sheds light on 'corpus-driven' and 'corpus-based' approaches (Panckhurst et al., forthcoming). We produced and submitted an XML encoding of 88milSMS, within the Dariah initiative in 2015 (Digital Research Infrastructure for the Arts and Humanities: Dariah-fr, http://www. dariah.fr/). A 2016 version of 88milSMS, which has been produced respecting XML, TEI guidelines and allows more widespread access, due to a CC BY 4.0 licence on the Ortolang platform (https://hdl. handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1; Panckhurst et al., in Chanier (ed), 2016), is another major step forward. This is indeed a further form of (shareable) annotation, which could be of use to the community. Thierry Chanier conducted an XML-TEI transfer for this v2 version of 88milSMS, including additional encoded metadata with detailed information on the project, the corpus, and the questionnaire.
I also hope-thanks to the two most recent XML, TEI initiatives-that the resource will be eligible for long-term archiving with the CINES (Centre Informatique National de l'Enseignement Supérieur, https://www.cines.fr/). This would mean that in the future, people could look back and explore these 'snapshot' resources and understand more about the evolution of scriptural practices and usages in the 21st century.

Acknowledgements
I would like to thank two anonymous reviewers for their valuable and thought-provoking remarks. Any remaining mistakes are of course my own.
This work was supported by the MSH-M (Maison des Sciences de l'Homme de Montpellier, France, http://www.msh-m.fr/), the DGLFLF (Délégation générale à la langue française et aux langues de France, http://www.dglflf.culture.gouv. fr/), and the CNRS (PEPS ECOMESS, HuMaIn). The SMS data described in this article was collected within the framework of the sud4science LR (http:// www.sud4science.org) project. It is part of a vast international SMS data collection project, entitled sms4science (http://www.sms4science.org), and was initiated at the CENTAL (Centre for Natural Language Processing, Université Catholique de Louvain, Belgium) in 2004. In particular, we thank Cédrick Fairon, Louise-Amélie Cougnon, and Hubert Naets (CENTAL), for their support, during our project. Many thanks to my colleagues, Catherine Détrie, Cédric Lopez, Claudine Moïse, Mathieu Roche, Bertrand Verine. The SMS project, Sud4science LR, would never have taken place had my colleagues decided not to join me in the adventure. We are very grateful to our 'Informatique et Libertés' (data protection legislation) legal advisor, Nicolas Hvoinsky, and his director, Stéphanie Delaunay (DAJI, Université Paul-Valéry Montpellier 3), who accompanied and legally advised our team throughout the project. We thank our student interns: Anthony Stifani (Master's student in Information and Communication, Université Paul-Valéry Montpellier 3), who manually analysed many of our text messages, thus allowing evaluation of the anonymization system; Pierre Accorsi and Namrata Patel (Master's students in Computer Science at the Université de Montpellier), who developed the 'Seek&Hide' software, used to anonymize the corpus; Michel Otell, Camille Lagarde-Belleville, Frédéric André, and Yosra Ghliss (Master's students in Language Sciences, Université Paul-Valéry Montpellier 3) who performed the online manual anonymization with 'Seek&Hide' and verified the automatic anonymization of the corpus; Aghiles