Abstract

This study explores the feasibility of cross-linguistic authorship attribution and the author’s gender identification using Machine Translation (MT). Computational stylistics experiments were conducted on a Greek blog corpus translated into English using Google’s Neural MT. A Random Forest algorithm was employed for authorship and gender profiling, using different feature groups [Author’s Multilevel N-gram Profiles, quantitative linguistics (QL), and cross-lingual word embeddings (CLWE)] in both original and translated texts. Results indicate that MT is a viable method for converting a multilingual corpus into one language for authorship attribution and gender profiling research, with considerable accuracy when training and testing datasets use identical language. In the pure cross-linguistic scenario, higher accuracies than the baselines were obtained using CLWE and QL features.

1. Introduction

Computational Stylistics has grown rapidly in recent years, with authorship attribution and author profiling being among its most common applications. By using vectors of measurements that capture stylistic variation and training Machine Learning algorithms on large corpora, we can reliably predict authors’ identity and traits such as gender, age, and personality. Supervised classification methods exhibit excellent performance in closed-class problems (e.g. David et al. 2016; Posadas-Durán et al. 2017). Advanced document representations, such as word embeddings, have enhanced stylometric approaches with semantic information and higher-level language structure patterns (e.g. Franco-Salvador et al. 2017). These features, paired with state-of-the-art Deep Learning approaches (e.g. Veenhoven et al. 2018), have made Computational Stylistics a promising and evolving research area in Natural Language Processing (NLP).

However, challenges remain, such as the need for a decent amount of textual data from each author to develop reliable quantitative profiles of their idiolect (Luyckx and Daelemans 2008; Zheng and Jin 2022), the largely unsolved open-class authorship problems and the performance reduction caused by mismatches between training and testing datasets in cross-topic and cross-genre authorship attribution.

The toughest experiment is cross-linguistic authorship attribution and author profiling, where texts are written in a non-native language of the author. This problem challenges the central idea of stylometry—that each author has a unique way of using language that is uniquely related to their identity (the “stylome” hypothesis). In this article, we assess computational stylistic approaches in texts translated into another language using Machine Translation (MT), introducing additional bias as the stylometric signal must survive both language conversion and natural language generation.

The remainder of the article is organized as follows: In Section 2, an overview of the cross-linguistic research on author profiling and authorship attribution will be provided. The fundamental methodological decisions we made for this study—including the corpus we created, the features we employed, and the algorithm we tested—are detailed in Section 3. Next, Section 4 evaluates and reports on the outcomes of our experiments. Finally, our key conclusions are outlined in Section 5, including future work.

2. Previous work

2.1 Author profiling

Author profiling, the task of analyzing text to uncover various traits of its author, is based on the theoretical notion of sociolect and the “stylome hypothesis.” Sociolect involves both passive and active engagement with communicative practices, influencing and reflecting socio-psychological traits that can influence language production and, reversely, that can be identified by observing the associated quantitative linguistic patterns. The stylome hypothesis dates back to the mid-2000s, when van Halteren et al. (2005) compared it to the genome, suggesting that an author’s stylometric profile is unique.

Many studies in author profiling have explored various demographic characteristics such as gender (Aravantinou et al. 2015; Veenhoven et al. 2018), age (Rangel and Rosso 2013; Lundeqvist and Svensson 2017), personality (Smynor 2015; McCollister 2016), political affiliation (Tumasjan et al. 2010; David et al. 2016), and even sexual orientation (Loh, Soo, and Xing 2016). In this study, we are focusing on deducing the author’s gender, as systematic differences in linguistic production between men and women have long been observed in nearly all aspects of linguistic levels (Mikros 2013a).

Gender identification has been the focus of many studies since the early 2000s, with Koppel, Argamon, and Shimoni (2002) pioneering the use of stylometric features for this purpose. Their analysis using the British National Corpus revealed gender-specific usage of text features, indicating the interaction between the author’s gender and the text genre. Subsequent studies have employed various document representations and machine learning algorithms for automatic gender identification. Document representations include Bag-of-Words (Bamman, Eisenstein, and Schnoebelen 2014), n-gram models (Potamianos and Jelinek 1998; Mikros and Perifanos 2015), stylometric features (Mikros 2020), and word embeddings (Franco-Salvador et al. 2017), whereas machine learning algorithms range from Logistic Regression (Nowson 2006) and Support Vector Machines (David et al. 2016; Stamatatos et al. 2018), to Random Forest (RF; Xu and Jelinek 2004) and Deep Learning models such as long short-term memory (LSTM; Veenhoven et al. 2018; Alroobaea et al. 2020) and convolutional neural networks (CNNs; Schaetti 2017). These experiments have been effective across various text genres, including social media and blogs, demonstrating significant differences in linguistic usage between male and female authors.

Recent studies (Lee 2019; Mikros 2020) have extended author profiling to translated texts, exploring whether they conform to standard stylometric patterns or exhibit specific peculiarities that constitute a different variety, a separate “dialect” within a language commonly referred to as “third code” (Frawley 1985) or “translationese” (Gellerstam 1986). Baroni and Bernardini (2005) and Bernardini and Baroni (2005) have shown that although a general “translationese” is identified, strong influences from the source language have also been noted. Most research on translationese and translation universals has been conducted using specific linguistic units of analysis to operationalize concepts such as interference, simplification, and normalization. Since “translationese” is a code that incorporates extra-linguistic information, we believe it is reasonable to extend the scope of the research question and form a research hypothesis about how the author’s gender is reflected in the (machine-)translated text. Translations are, after all, language products and are expected to acquire the specific linguistic features typical of a given author (idiolect) and particular social variables describing the text’s author (sociolect). It is important to note the distinction between translator style, which refers to the unique style of an individual translator, and translation style, which emerges as a result of the translation process itself (Saldanha 2011). The focus of this study is on the former, investigating how the translation process influences the reflection of the author’s gender in the translated text.

2.2 Authorship attribution

Authorship attribution, on the other hand, employs computational techniques to determine a text’s author based on their unique writing style (Neal et al. 2017; Sari 2018). There are three primary forms of authorship attribution: author verification, closed-set and open-set attribution (Barlas and Stamatatos 2020). In author verification, a single author is considered, whereas closed-set and open-set involve multiple candidate authors, with the latter allowing for an unknown author. Posadas-Durán et al. (2017) achieved over 98 per cent accuracy in binary author verification using word n-grams and Doc2vec. Other methods range from analyzing lexical, syntactical, and grammatical features (Argamon and Levitan 2005; van Halteren et al. 2005; Zhao and Zobel 2005; Grieve 2007; Wu, Zhang, and Wu 2021; Zheng and Jin 2022) to only focusing on function or content words (Kestemont 2014; Boumparis, n.d.).

Deep Learning, particularly CNNs, has shown promising results in authorship attribution (Zhang, Zhao, and LeCun 2015; Ruder, Ghaffari, and Breslin 2016; Shrestha et al. 2017). Character-level features often outperform word sequences, a finding also supported by Shrestha et al. (2017). Pretrained language models like BERT, enhanced with additional dense layers and softmax activation, have achieved accuracy levels up to 5 per cent above other approaches (Fabien et al. 2020). According to Zheng and Jin (2022), the effectiveness of Deep Learning is attributed to its autonomous feature engineering and hierarchical, complex network structure. However, in open-set attribution, where the author is not in the candidate list, linear classifiers achieve near-perfect accuracy but require more robust approaches for larger candidate pools (Badirli et al. 2021).

Bogdanova and Lazaridou (2014) introduced the task of Cross-Language Authorship Attribution. They investigated author identification in texts translated into different languages than the one in the labeled data. They reported 95 per cent and 93 per cent accuracy for English and Spanish, respectively, using linear SVM with Bag-of-Words. MT reduced the accuracy, although high-level features pairing with k-Nearest Neighbors scored promising results up to 95 per cent.

One admittedly deciding factor in authorship attribution tasks is the data size per candidate author, with many studies measuring the impact of the corpus size (Jin and Murakami 2007; Luyckx and Daelemans 2008; Zheng and Jin 2022). Argamon and Levitan (2005) predicted the author’s identity and nationality in novels with an average of 10,000 words per book, with 99 per cent and 93.5 per cent accuracy, respectively. For shorter texts like Tweets, RF generally yields the best results (Tanaka and Jin 2014; Rao, Raju, and Kumar 2017), guiding our choice of methodology in our series of experiments discussed in the following section.

3. Research methodology

3.1 Research aims

Our study focuses on the impact of MT on authorship attribution and gender identification tasks. Particularly, we want to investigate whether a machine-translated text preserves its stylometric profile and whether author profiling methods can be used reliably across a machine-translated text. The motivation behind this research aim is to validate cross-linguistic stylometric analysis using MT as a potential method to overcome specific language barriers and work with texts in one language regardless of their source language. If the target language of this methodology is English, then we will be able to leverage the vast array of the available linguistic resources and tools existing for English and apply state-of-the-art NLP methods that are yet to be utilized in other languages. To explore the research mentioned above, we set two more specific research objectives. In particular:

  • Compare different feature groups (Author’s Multilevel N-gram Profiles—AMNP, lexical diversity features—quantitative linguistics (QL), cross-lingual word embeddings—CLWE) in the authorship attribution and the gender identification task and investigate their relative accuracy in both the source and the target language datasets.

  • Explore whether cross-linguistic authorship attribution and authors’ gender profiling are feasible when the training dataset uses the source language and the testing dataset is the translated language.

3.2 Corpus

To explore the above research objectives, we have conducted a small-scale experiment focused on texts written in Modern Greek and translated into English using Google’s Neural MT (NMT). We first compiled a Greek blog corpus to be used as our source language dataset. To this end, we harvested the Greek blogosphere and manually collected 100 Greek blogs equally divided into fifty male and fifty female bloggers. The bloggers’ gender was determined based on their own descriptions on their profile pages and short bios.

In this study, we opted to use Google’s NMT due to its widespread availability, robust performance across a wide range of languages, and its neural architecture, which has been shown to produce more fluent and accurate translations compared to statistical MT systems (Wu et al., 2016). However, we acknowledge that other MT tools, such as DeepL or EU’s e-Translation, may yield different results, and future research could explore the impact of various MT systems on cross-linguistic authorship attribution and gender profiling tasks.

The compiled corpus was translated into English using the Google Translate API. Due to the API’s 5,000-character quota limit, we needed to write a Python script that splits the texts that are longer than this pre-defined threshold into chunks prior to sending the request to the API and getting back the translated text as a response. To minimize mistranslations, we took whole sentences into account. Namely, we split at the very last punctuation mark of the sentences whose total number of characters would not exceed the threshold if added to the total number of characters up to that point. As soon as the API sends the response back containing the machine-translated text, the script would reconnect the split texts into a single target file.

The descriptive statistics for both the source and the target language corpora can be found in Table 1.

Table 1.

Descriptive statistics for the source (Greek—EL) and target (English—EN) language corpus.

Source (EL)
Target (EN)
N wordsSDminmaxN wordsSDminmax
Female899,099295104,653918,189315105,167
Male873,628396107,921907,862410107,749
Total1,772,727349107,9211,826,051366107,749
Source (EL)
Target (EN)
N wordsSDminmaxN wordsSDminmax
Female899,099295104,653918,189315105,167
Male873,628396107,921907,862410107,749
Total1,772,727349107,9211,826,051366107,749
Table 1.

Descriptive statistics for the source (Greek—EL) and target (English—EN) language corpus.

Source (EL)
Target (EN)
N wordsSDminmaxN wordsSDminmax
Female899,099295104,653918,189315105,167
Male873,628396107,921907,862410107,749
Total1,772,727349107,9211,826,051366107,749
Source (EL)
Target (EN)
N wordsSDminmaxN wordsSDminmax
Female899,099295104,653918,189315105,167
Male873,628396107,921907,862410107,749
Total1,772,727349107,9211,826,051366107,749

The descriptive statistics reported for both corpora show that the translation process did not affect in any way the quantitative profile of the texts written by males and females. All quantitative gender-based differences have been preserved across all metrics (text size [N words], standard deviation of the text size [SD], minimum text size [min], and maximum text size [max]).

3.3 Feature group comparison

We studied the impact of several features commonly used in authorship profiling tasks to get a better understanding of which dimensions of the stylometric profile are affected during the MT process. We grouped the features into three distinct categories:

3.3.1 AMNP

AMNP, introduced by Mikros and Perifanos (2013), is a highly effective document representation for authorship attribution and author profiling tasks (Mikros 2013a, 2013b; Mikros and Perifanos 2015; Mikros 2018). It is based on increasing n-gram sizes of character and word units, ensuring coverage of different linguistic levels. The theoretical motivation behind AMNP comes from the Prague School of Linguistics and the concept of “double articulation” (Nöth 1995: 238), which states that language is divided into two layers: meaningful units (morphemes) and minimal functional units (phonemes). The combination of these layers produces grammatically correct linguistic production. Similarly, to capture the multi-level manifestation of stylistic traits, features present at various linguistic levels must be detected and combined to accurately represent an author’s style.

Here, we extracted the 1,000 most frequent characters and word n-grams with n = 2 and 3, resulting in a total vector of 4,000 features. The resulting vector is the AMNP, a document representation that simultaneously captures both character and word sequences. We calculated all features’ normalized frequency to prevent introducing text length bias into subsequent calculations.

3.3.2 Lexical diversity (QL)

Lexical diversity indices, inspired by QL, recognize the lexicon as the basis of human core language ability (Long and Richards 2007) and the main source of language variation (Bates and Goodman 1997 in Treffers-Daller 2011: 149). These measures capture the vocabulary size using mathematical approaches applied to types or lexemes in a text (Read 2000: 200).

Being complex constructs, they approach deeper aspects of language production, with lexical knowledge as the most important component of variability in language ability (Treffers-Daller 2011: 149). They have been used to diagnose cognitive disorders like Alzheimer’s (van Velzen, Nanetti, and de Deyn 2014), Parkinson’s (Ellis, Holt, and West 2015), SLI (Owen and Leonard 2002), and aphasia (Fergadiotis, Wright, and West 2013), indicating their correlation with cognitive aspects of language production.

Lexical richness indices have also been used to assess bilingual speakers’ language production (Daller, van Hout, and Treffers‐Daller 2003; Treffers-Daller 2011; Treffers-Daller and Korybski 2015), playing an important role in detecting language dominance and code-switching. We included a list of features calculated using specialized software for quantitative linguistic indices (Kubát, Matlach, and Čech 2014) and custom scripts:

  • h-point: This index represents the “bisector” point of the rank ∼ frequency distribution at which rank = frequency. Jorge E. Hirsch (2005) initially proposed it for scientometrics, and Popescu (2007) introduced it into linguistics and developed it further (Popescu, Best, and Altmann 2007). Using the definition mentioned above, the h-point splits the vocabulary into two essential parts, namely into a class of magnitude h of frequent function words (synsemantics) and a much larger class of content words (autosemantics) with size Vh which are not so frequent but constitute the better part of the text’s vocabulary (Popescu et al. 2009: 19).

  • Entropy (H): The term has been used across many scientific disciplines with different meanings, mainly to quantitatively define the diversity or the uncertainty of a system. In this article, entropy is calculated using Shannon’s formula on the word frequencies of the corpus (Oakes 1998: 59). As a result, texts with extensive vocabularies and low frequencies produce high entropy, while texts with controlled vocabularies and formulaic or systematic word usage exhibit lower entropy.

  • Yule’s characteristic K: A measure of vocabulary “richness” based on the work of Yule (1944). The index measures the lexical Repeat Rate (RR) and has been found to be sufficiently robust regardless of the text size from which it is calculated (Tweedie and Baayen 1998).

  • Writer’s view: An index proposed by Popescu and Altmann (2007) that is connected to the golden ratio (φ ≈ 1.618). It is defined as the angle that is formed between the word frequency ∼ rank distribution end and its top, as seen from the h-point (Popescu et al. 2009: 26). Commenting on its name, they claim that it is “baptized in this way because one can imagine the writer ‘sitting’ at this point and controlling the equilibrium between autosemantics and synsemantics.”

  • R1: An index of vocabulary richness proposed by Popescu et al. (2009: 29–34), which is based on the h-point and the cumulative relative frequencies up to the h-point.

  • RR: The RR shows a text’s degree of vocabulary concentration. In other words, this indicator measures vocabulary richness inversely: the higher the RR, the less vocabulary diverse the text. The resulting values of RR are in the interval <1/V; 1> where V is the number of lexical types in the text (Kubát, Matlach, and Čech 2014).

  • Relative RR of McIntosh (RRmc): This is a normalized index of RR so that it takes values in the interval [0, 1], originally proposed by McIntosh (1967). The main difference with the RR is that RRmc takes values in the closed interval [0, 1], while the RR from Kubát, Matlach, and Čech (2014) cannot take the value of 0.

  • Curve Length (L): A vocabulary richness index based on the rank ∼ frequency distribution curve, defined as the sum of Euclidean distances between all points on the curve (Kubát, Matlach, and Čech 2014).

  • Curve length R Index (R): A vocabulary richness index derived from the curve length (L). It is defined as the ratio of the curve length above the h-point to the whole curve length (Kubát, Matlach, and Čech 2014).

  • Adjusted Modulus (A): A frequency structure indicator that is supposed to be independent of text length (Popescu et al. 2010).

  • Gini Coefficient (G): A measure of statistical dispersion originally developed for econometric analysis based on the Lorenz curve. It can be used to measure vocabulary richness by considering the rank ∼ frequency distribution by reversing the rank order (Popescu et al. 2009).

3.3.3 CLWE

Word embeddings are a popular representation of a text’s vocabulary. Words or phrases from the lexicon are translated into vectors of real numbers and can subsequently be used in language modeling and feature learning. These numerical vectors can capture text representations in an n-dimensional space, where words with the same meaning are represented similarly. This indicates that two comparable words are located close to each other in vector space and have similar vector representations. Therefore, the objective of creating a word embedding space is to record some form of relationship in that space, be it a relationship based on meaning, morphology, context, or another type of link. Since language produced by different genders exhibits systematic differences across all levels of its linguistic organization (Mikros 2013b), we foresee that word embeddings can be used effectively for the gender identification task. In fact, several studies have reported satisfactory results in various author profiling tasks (Bayot and Gonçalves 2016; Bojanowski et al. 2017; Dias and Paraboni 2018; López-Santillán et al. 2020).

Since, in our study, we are using texts in their source language (Greek) and their translations in a target language (English), we decided to explore how CLWE could be used as a language-neutral approach to the features used for training the classification algorithm both in the authorship attribution and the author profiling tasks. CLWE are a type of word representation that extends the concept of word embeddings to multiple languages. They represent translation-equivalent words from two or more languages close to each other in a cross-lingual space. This allows for comparing and transferring semantic information across different languages.

We created CLWE using monolingual mapping, a method to create a shared semantic space for words from different languages using pre-trained monolingual word embeddings on large corpora within each language independently. To obtain the pretrained word embeddings, we utilized fastText1 (Bojanowski et al. 2017) for both Greek and English. FastText was chosen for its ability to capture subword information, which is particularly useful for morphologically rich languages like Greek. Additionally, fastText embeddings have been shown to perform well in various downstream NLP tasks (Joulin et al. 2017). Nevertheless, we recognize that other language-agnostic multilingual sentence embeddings, such as LaSER (Artetxe and Schwenk 2019) or LaBERT (Feng et al. 2022), could potentially impact the results, and future research could investigate the performance of these alternative embeddings in cross-linguistic stylometric tasks.

Since these embeddings capture the semantic and syntactic properties of words within their respective languages, the next step was to learn a transformation that aligns the monolingual embedding spaces, effectively mapping them into a shared cross-lingual space. In our case, the alignment was achieved through a linear transformation (Orthogonal Procrustes), which preserves the monolingual properties while aligning translation-equivalent words, finding an orthogonal transformation (a matrix) that most closely maps one set of vectors to another. This process required a bilingual dictionary which was used to supervise the mapping and was provided by the MUSE library (Lample et al. 2018).

For each text, we narrowed down our vocabulary to the 1,000 most frequent words (MFWs) since these carry the most robust stylometric information (Zhao and Zobel 2005; Argamon et al. 2007; García and Martín 2007). We averaged the word embeddings calculated across these 1,000 MFWs, resulting in a document representation where each text vector contains 300-word embeddings representing the mean of the corpus’s 1,000 MFWs.

3.4 Machine learning approach

We trained an RF classifier (Breiman 2001) for authorship attribution and gender profiling, choosing it over more recent deep learning algorithms for its stability and robustness. Our focus is on understanding how MT impacts stylometric analysis, without necessarily achieving state-of-the-art classification. RF, an ensemble of Decision Trees, excels at handling large datasets with many variables. It introduces randomness by bootstrapping data samples and selecting random feature subsets for node splitting, balancing variance and bias. This methodology enhances the general accuracy of the model.

All statistical models developed as part of this study were evaluated using 10-fold Cross-Validation (with 90 per cent to 10 per cent train-test split), and the accuracies reported here represent the mean of the accuracies obtained in each fold. Moreover, all other evaluation metrics (precision, recall, F1) were calculated using macro-averaging since we were interested in understanding the performance of the different models equally across each class. Since the feature space in the AMNP datasets was sparse, we eliminated all features that showed a variance close to zero using the two following rules:

  • the percentage of unique values was less than 20 per cent;

  • the ratio of the most frequent to the second most frequent value was greater than 20.

Removing the features with near-zero variance shrank the number of employed features and reduced them by 54 per cent for the Modern Greek dataset (from 4,000 to 1,838) and 58.6 per cent for the English dataset (from 4,000 to 1,656 features). Moreover, to normalize the different scales employed in the three mentioned feature groups, we standardized the data to z-scores with M=0 and SD=1 before training the RF model.

4. Results

The classification experiments are organized in two different sections according to the task at hand—authorship attribution and gender profiling, respectively. In each section, we report the evaluation metrics (accuracy, recall, precision, and F1-score) for each feature group in four distinct experimental setups:

  • Baseline accuracy: the classification’s expected accuracy if performed randomly.

  • Greek dataset: the obtained cross-validated averaged evaluation metrics when the classification was performed on the Greek dataset.

  • English dataset: the obtained cross-validated averaged evaluation metrics when the classification was performed on the English dataset.

  • Cross-linguistic classification: the obtained cross-validated averaged evaluation metrics when the RF was trained using the Greek dataset, and the obtained model was used to make predictions on the English one. This is a simulation of the actual cross-linguistic scenario in which we create models of authors and their gender in one language and then use these models to detect their identity and gender after their texts have been translated into another language—here, English.

4.1 Authorship attribution experiments

The results obtained in the authorship attribution task are reported in Tables 2–4 below (bold values represent the highest metrics obtained in the specific task):

Table 2.

Experiment 1a: Authorship attribution using the AMNP feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.010.010.010.01
English0.320.310.270.27
Greek0.350.340.320.30
Baseline0.01
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.010.010.010.01
English0.320.310.270.27
Greek0.350.340.320.30
Baseline0.01

Bold values represent the highest metrics obtained in the specific task.

Table 2.

Experiment 1a: Authorship attribution using the AMNP feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.010.010.010.01
English0.320.310.270.27
Greek0.350.340.320.30
Baseline0.01
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.010.010.010.01
English0.320.310.270.27
Greek0.350.340.320.30
Baseline0.01

Bold values represent the highest metrics obtained in the specific task.

Table 3.

Experiment 1b: Authorship attribution using the QL feature group.

Experimental SetupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.040.040.050.03
English0.110.110.090.09
Greek0.120.120.110.10
Baseline0.01
Experimental SetupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.040.040.050.03
English0.110.110.090.09
Greek0.120.120.110.10
Baseline0.01
Table 3.

Experiment 1b: Authorship attribution using the QL feature group.

Experimental SetupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.040.040.050.03
English0.110.110.090.09
Greek0.120.120.110.10
Baseline0.01
Experimental SetupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.040.040.050.03
English0.110.110.090.09
Greek0.120.120.110.10
Baseline0.01
Table 4.

Experiment 1c: Authorship attribution using the CLWE feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.050.050.200.05
English0.270.270.230.23
Greek0.300.300.260.25
Baseline0.01
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.050.050.200.05
English0.270.270.230.23
Greek0.300.300.260.25
Baseline0.01

Bold values represent the highest metrics obtained in the specific task.

Table 4.

Experiment 1c: Authorship attribution using the CLWE feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.050.050.200.05
English0.270.270.230.23
Greek0.300.300.260.25
Baseline0.01
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.050.050.200.05
English0.270.270.230.23
Greek0.300.300.260.25
Baseline0.01

Bold values represent the highest metrics obtained in the specific task.

The results reported in the authorship attribution task can be summarized in the following points:

  • The highest-scoring feature group in both the monolingual Greek and the English data is the AMNP, followed closely by the CLWE. This is an interesting finding since AMNP contains six times more features (1,838 in the Greek and 1,656 in the English dataset) than CLWE (300 features in either dataset). This confirms previous research findings (Hoenen 2017; Barlas and Stamatatos 2020; Kumar, Gowtham, and Chakraborty 2022), all of whom report excellent authorship attribution results using various forms of word embeddings.

  • In the cross-linguistic authorship attribution experiment, the most effective feature group was CLWE, which achieved results five times more accurate than the baseline, scoring 0.05 compared to 0.01. The second most successful feature set was the Quantitative Linguistics (QL), which outperformed the baseline by four times, achieving a score of 0.04. Both CLWE and QL demonstrate strong potential for effective cross-linguistic authorship attribution. Notably, QL’s accuracy is particularly impressive given its small size of only thirteen features, in contrast to CLWE’s 300 features. QL’s high performance aligns with previous findings by Juola, Mikros, and Vinsick (2019a, 2019b), who highlighted significant correlations between Greek, Spanish, and English QL indices among Greek texts. While the AMNP feature group shows good performance in monolingual datasets, its performance in the cross-linguistic setting only matched the baseline. This outcome was expected, as the n-grams in the source language differ from those in the target.

To better understand the dynamic interaction of the variables involved in the authorship attribution experiments reported above, we statistically evaluated the accuracy results further. Before conducting the analysis, we examined the assumptions of the two-way analysis of variance (ANOVA). Levene’s test for homogeneity of variances yielded a non-statistically significant result (F =1.013, df1=5, df2=174, P=.411), indicating that the assumption of equal variances across groups was met. Additionally, the Shapiro–Wilk normality test was performed to assess the normality of the data. The results showed that the accuracy scores for each feature group (AMNP: Statistic=0.977, df=60, P=.315; CLWE: Statistic=0.987, df=60, P=.749; QL: Statistic=0.968, df=60, P=.114) were not significantly different from a normal distribution, satisfying the assumption for normality.

Having confirmed that the assumptions were met, we proceeded with the two-way ANOVA, using the classification accuracy of each fold in the conducted experiments as the dependent variable, and the feature group used and the language employed as the independent variables. We fitted the model of each feature group to 30-folds to create enough data points so that the statistical analysis achieves sufficient statistical power (Bausell and Li 2002). The analysis is restricted to the performance recorded for the two monolingual experiments since, in the cross-linguistic task, we did not employ cross-validation, and the trained Greek model performed a prediction on the English dataset. The interaction plot of the variables involved is shown below (Fig. 1):

Interaction plot of the language and the feature group used in the classification accuracy of the authorship attribution.
Figure 1

Interaction plot of the language and the feature group used in the classification accuracy of the authorship attribution.

The two-way ANOVA revealed a statistically significant interaction in all the main effects but not in the interaction of the independent variables. The main effects analysis showed that both the Feature Group (F(2, 174)=857, P<.001) and the Language (F(1, 174)=42, P<.001) did have a statistically significant effect on the classification accuracy. However, the effect size (partial η2) of the two main effects is considerably different. The Feature Group exhibits a big effect size (0.90), while the Language shows a considerably lower one (0.19).

To gain a clearer insight into the specific differences among the features, we employed Tukey’s post-hoc test for multiple comparisons across the three distinct feature groups. This method is particularly suitable as it effectively handles pairwise comparisons within datasets that contain an equal number of subjects in each group, as noted by Stoll (2017). All the differences between all feature groups were found to be statistically significant, indicating that each feature group was very different from the others in terms of accuracy performance. Overall, we can conclude that all the feature groups employed in this study perform differently from each other but are similar in both languages involved.

Thus, the accuracy of authorship attribution is better in the source language, indicating that MT degrades the stylometric information during generation. However, judging by the relatively small effect size mentioned above, this effect is not massive, and with the appropriate experimentation with different classification algorithms and feature groups, it still can be considered a viable option for cross-linguistic authorship attribution.

4.2 Gender profiling experiments

We performed the gender profiling classifications by following the same methodology described in Section 4.1. We used the same four experimental setups as in the authorship attribution task, and the results are reported in Table 5–7, respectively (bold values represent the highest metrics obtained in the specific task):

Table 5.

Experiment 2a: Gender profiling using the AMNP feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.500.570.500.54
English0.670.680.660.67
Greek0.690.700.690.69
Baseline0.50
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.500.570.500.54
English0.670.680.660.67
Greek0.690.700.690.69
Baseline0.50
Table 5.

Experiment 2a: Gender profiling using the AMNP feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.500.570.500.54
English0.670.680.660.67
Greek0.690.700.690.69
Baseline0.50
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.500.570.500.54
English0.670.680.660.67
Greek0.690.700.690.69
Baseline0.50
Table 6.

Experiment 2b: Gender profiling using the QL feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.530.530.560.46
English0.600.630.600.62
Greek0.560.580.570.57
Baseline0.50
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.530.530.560.46
English0.600.630.600.62
Greek0.560.580.570.57
Baseline0.50
Table 6.

Experiment 2b: Gender profiling using the QL feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.530.530.560.46
English0.600.630.600.62
Greek0.560.580.570.57
Baseline0.50
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.530.530.560.46
English0.600.630.600.62
Greek0.560.580.570.57
Baseline0.50
Table 7.

Experiment 2c: Gender profiling using the CLWE feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.540.470.540.50
English0.690.710.680.69
Greek0.690.720.680.70
Baseline0.50
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.540.470.540.50
English0.690.710.680.69
Greek0.690.720.680.70
Baseline0.50

Bold values represent the highest metrics obtained in the specific task.

Table 7.

Experiment 2c: Gender profiling using the CLWE feature group.

Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.540.470.540.50
English0.690.710.680.69
Greek0.690.720.680.70
Baseline0.50
Experimental setupsAccuracyRecallPrecisionF1
Cross-linguistic (EL: training ⇒ EN: test)0.540.470.540.50
English0.690.710.680.69
Greek0.690.720.680.70
Baseline0.50

Bold values represent the highest metrics obtained in the specific task.

From the results above, we can conclude the following:

  • Word embeddings are the best-performing feature group for gender profiling, followed by AMNP in the monolingual corpora. This confirms the superiority of word embeddings as a feature group in a wide range of computational stylistics tasks, as already mentioned above.

  • In the cross-linguistic experiment, both CLWE and QL performed nearly equally (0.54 and 0.53 correspondingly), indicating that both feature groups can convey cross-linguistic stylometric information. AMNP, on the other hand, scored on par with the baseline, as was expected due to the difference in the n-grams involved in the source and target language.

To more systematically analyze the above observations, we performed a two-way ANOVA using as dependent variable the gender classification accuracy and as independent variables the language and the feature group used in the gender profiling experiments. Again, here, the analysis involves only the separate gender classification experiments that were conducted in each monolingual dataset. We followed the same procedure as in the authorship attribution analysis and tested the homogeneity of variances and the normality assumption. Levene’s test yielded a non-statistically significant result (F=0.360, df1=5, df2=174, P=.875), indicating that the assumption of equal variances across groups was met. Additionally, the Shapiro–Wilk normality test was performed to assess the normality of the data. The results showed that the accuracy scores for each feature group (AMNP: Statistic=0.975, df=60, P=.266; CLWE: Statistic=0.986, df=60, P=.730; QL: Statistic=0.982, df=60, P=.526) were not significantly different from a normal distribution. The interaction plot is shown in Fig. 2:

Interaction plot of the language and the feature group used in the classification accuracy of the gender profiling.
Figure 2

Interaction plot of the language and the feature group used in the classification accuracy of the gender profiling.

The two-way ANOVA revealed that the main effect of the feature group was found to be statistically significant (F(2, 174)=122.7, P<0.001) with a considerable effect size (partial η2=0.58). We further analysed this main effect with Tukey’s post-hoc test which grouped the AMNP and the CLWE feature groups together separating them from the QL since these two feature groups achieve nearly identical classification accuracies.

Furthermore, the analysis showed a statistically significant interaction between the Feature Group and Language (F(2, 174)=3.2, P<.05). When using the AMNP feature group, gender profiling is more accurate for the Greek dataset, a trend that shifts when employing the QL features, which appear to perform better with the English dataset.

Contrary to the results observed in the authorship attribution experiments, in gender profiling, MT does not appear to significantly reduce classification accuracy. This suggests that the stylometric information pertinent to gender is more resilient during language switching.

5. Conclusions

This study investigated whether using MT to translate original documents into another language can be a reliable strategy for cross-linguistic authorship attribution and gender profiling tasks. We used a large Greek blog corpus translated automatically into English and ran experiments using three different types of features: AMNP, CLWE, and QL.

In the monolingual corpora experiments for authorship attribution, AMNP was efficient in both languages, confirming its superiority in current stylometric research. CLWE also proved highly effective in both source and target languages. Authorship attribution tends to be more precise in the source language, suggesting that MT slightly impairs stylometric characteristics. However, the modest effect size suggests it is manageable, and using translated texts for cross-linguistic authorship attribution can still be considered feasible with appropriate methodological adjustments.

In cross-linguistic authorship attribution with training data in the source language and testing data in the target language, CLWE achieved accuracy quintuple that of the baseline (0.05 versus 0.01), followed closely by QL with fourfold accuracy (0.04 versus 0.01). QL’s effectiveness is remarkable, given its minimal feature count (13) compared to CLWE (1,000). QL indices approach a stable trait of idiolect above the specific language used, with lexical “richness” patterns transferred across linguistic codes and likely associated with higher-level intellectual functions. However, their impact on authorship attribution is limited, and more research is needed to identify stylometric features that associate more efficiently with authors’ identities across languages.

Gender profiling utilizes distinct feature groups, primarily relying on CLWE’s rich semantic representation to uncover diverse writing styles of male and female authors, including differences in word choice, thematic emphasis, and underlying syntactic patterns. CLWE can identify tendencies like more emotive and descriptive language in female authors versus assertive or direct language in male authors and distinguish genders in subject matter and narrative style.

The study found that using MT to convert a multilingual corpus to one language for authorship attribution and gender profiling is feasible with comparable accuracies. Gender profiling accuracy was equal across languages, while authorship attribution was slightly better in the source language with a minor effect on the language condition.

In the scenario of cross-linguistic authorship attribution and gender profiling with trained models and testing data in different languages, CLWE and QL feature groups offered better-than-chance probability in predicting the author’s identity and gender. Although accuracies were lower than in the monolingual experiments, they provide a foundation for more comprehensive research in this area.

However, it is important to note that both Greek and English are Indo-European languages with relatively similar structures and properties. The results obtained in this study may not be generalizable to languages from other language families with significantly different characteristics. Furthermore, when considering languages with vastly different structures, such as polysynthetic languages where a single word may correspond to multiple words in English, the effectiveness of our approach may be limited. In such cases, the translation process could significantly alter the word counts and other linguistic features, potentially affecting the performance of stylometric analyses. The applicability of word embeddings in these scenarios remains an open question that calls for further research efforts.

Moreover, it is important to acknowledge that the tools we chose for obtaining the translations and the multilingual sentence embeddings may have influenced the results of our study. While both Google’s NMT and fastText were selected for their strong performance and wide applicability, it is possible that using alternative MT tools, such as e-Translation or DeepL, or different language-agnostic multilingual sentence embeddings, like LaSER or LaBERT, could lead to varying results.

Future research should explore the impact of these choices on cross-linguistic authorship attribution and gender profiling tasks to better understand the robustness and generalizability of our findings. Additionally, comparing the performance of various MT tools and multilingual embeddings could provide helpful insights into the optimal setup for cross-linguistic stylometric analyses. Moreover, a wider variety of text genres and an enriched set of language pairs and machine learning algorithms would provide generalization validity in our conclusions. We plan to investigate further the properties of lexical diversity features in cross-linguistic stylometry and experiment with novel approaches that capture more dynamic aspects of text production in the future.

Acknowledgement

Open Access funding provided by the Qatar National Library.

Authors’ contributions

George Mikros (Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing—original draft, and Writing—review & editing) and Dimitris Boumparis (Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Writing—original draft, and Writing—review & editing)

Footnotes

References

Alroobaea
R.
et al. (
2020
) ‘A Deep Learning Model to Predict Gender, Age and Occupation of the Celebrities based on Tweets Followers’, in: Cappellato, L., Eickhoff, C., Ferro, N., and Névéol, A. (eds) Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Vol. 2696. Thessaloniki, Greece: CEUR Workshop Proceedings. https://pan.webis.de/downloads/publications/papers/alroobaea_2020.pdf.

Aravantinou
C.
et al. (
2015
) ‘Gender Classification of Web Authors Using Feature Selection and Language Models’, in
Ronzhin
A.
,
Potapova
R.
,
Fakotakis
N.
(eds.)
Speech and Computer. SPECOM 2015
, pp.
226
233
.
Berlin, Germany
:
Springer International Publishing
. https://doi.org/10.1007/978-3-319-23132-7_28.

Argamon
S.
,
Levitan
S.
(
2005
) ‘Measuring the usefulness of function words for authorship attribution’, Proceedings of the 2005 ACH/ALLC Conference. Victoria, BC, Canada: ACH/ALLC

Argamon
S.
et al. (
2007
) ‘
Stylistic Text Classification Using Functional Lexical Features
’,
Journal of American Society for Information Science and Technology
,
58
:
802
22
. https://doi.org/10.1002/asi.v58:6.

Artetxe
M.
,
Schwenk
H.
(
2019
) ‘
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond’,
Transactions of the Association for Computational Linguistics
,
7
:
597
610
. https://doi.org/10.1162/tacl_a_00288.

Badirli
S.
et al. (
2021
) ‘Open Set Authorship Attribution Toward Demystifying Victorian Periodicals’, in:
Lladós
J.
,
Lopresti
D.
,
Uchida
S.
(eds)
Document Analysis and Recognition—ICDAR 2021
, pp.
221
35
.
Berlin, Germany
:
Springer International Publishing
. https://doi.org/10.1007/978-3-030-86337-1_15.

Bamman
D.
,
Eisenstein
J.
,
Schnoebelen
T.
(
2014
) ‘
Gender Identity and Lexical Variation in Social Media’,
Journal of Sociolinguistics
,
18
:
135
60
. https://doi.org/10.1111/josl.12080.

Barlas
G.
,
Stamatatos
E.
(
2020
) ‘Cross-Domain Authorship Attribution Using Pre-trained Language Models’, in:
Maglogiannis
I.
,
Iliadis
L.
,
Pimenidis
E.
(eds)
Artificial Intelligence Applications and Innovations
, Vol.
583
, pp.
255
66
.
Berlin, Germany
:
Springer International Publishing
. https://doi.org/10.1007/978-3-030-49161-1_22.

Baroni
M.
,
Bernardini
S.
(
2005
) ‘
A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text’,
Literary and Linguistic Computing
,
21
:
259
74
. https://doi.org/10.1093/llc/fqi039.

Bates
E.
,
Goodman
J. C.
(
1997
) ‘
On the Inseparability of Grammar and the Lexicon: Evidence from the Acquisition, Aphasia and Real-time Processing’,
Language and Cognitive Processes
,
12
:
507
84
. https://doi.org/10.1080/016909697386628.

Bausell
R. B.
,
Li
Y. F.
(
2002
)
Power Analysis for Experimental Research: A Practical Guide for the Biological, Medical and Social Sciences
.
Cambridge
:
Cambridge University Press
.

Bayot
R.
,
Gonçalves
T.
(
2016
) ‘Multilingual author profiling using word embedding averages and SVMs’, 2016 10th International Conference on Software, Knowledge, Information Management & Applications (SKIMA), pp.
382
6
. Piscataway, NJ: IEEE. https://doi.org/10.1109/SKIMA.2016.7916251.

Bernardini
S.
,
Baroni
M.
(
2005
)
Spotting Translationese. A Corpus-Driven Approach Using Support Vector Machines
. In Proceedings of Corpus Linguistics Conference Series 2005,
Birmingham, UK: Birmingham University
.

Bogdanova
D.
,
Lazaridou
A.
(
2014
) ‘Cross-language authorship attribution’, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp.
2015
2020
. Luxemburg: European Language Resources Association. http://www.lrec-conf.org/proceedings/lrec2014/pdf/145_Paper.pdf.

Bojanowski
P.
et al. (
2017
) ‘
Enriching Word Vectors with Subword Information’,
Transactions of the Association for Computational Linguistics
,
5
:
135
46
. https://doi.org/10.1162/tacl_a_00051.

Boumparis
D.
(
n.d
.)
Identifying Crosswriters’ Altering Style in Books for Children and Adults Using Supervised Machine Learning
.
Antwerp, Belgium
:
University of Antwerp
[paper]. https://github.com/dimboump/crosswriters.

Breiman
L.
(
2001
) ‘
Random Forests’,
Machine Learning
,
45
:
5
32
. https://doi.org/10.1023/A:1010933404324.

Daller
H.
,
van Hout
R.
,
Treffers‐Daller
J.
(
2003
) ‘
Lexical Richness in the Spontaneous Speech of Bilinguals’,
Applied Linguistics
,
24
:
197
222
. https://doi.org/10.1093/applin/24.2.197.

David
E.
et al. (
2016
) ‘
Utilizing Facebook Pages of the Political Parties to Automatically Predict the Political Orientation of Facebook Users’,
Online Information Review
,
40
:
610
23
. https://doi.org/10.1108/OIR-09-2015-0308.

Dias
R. F. S.
,
Paraboni
I.
(
2018
) ‘Author profiling using word embeddings with subword information: notebook for PAN at CLEF 2018’, in: Cappellato, L., Ferro, N., Nie, J.-Y., and Soulier, L. (eds) Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum. Avignon, France: CEUR Workshop Proceedings. http://ceur-ws.org/Vol-2125/paper_97.pdf.

Ellis
C.
,
Holt
Y. F.
,
West
T.
(
2015
) ‘
Lexical Diversity in Parkinson’s Disease’,
Journal of Clinical Movement Disorders
,
2
:
1
6
. https://doi.org/10.1186/s40734-015-0017-4.

Fabien
M.
et al. (
2020
) ‘BertAA: BERT fine-tuning for authorship attribution’, in: Bhattacharyya, P., Sharma, D. M., and Sangal, R. (eds.) Proceedings of the 17th International Conference on Natural Language Processing, pp.
127
137
. Indian Institute of Technology Patna, Patna, India: NLP Association of India (NLPAI). https://aclanthology.org/2020.icon-main.16.pdf.

Feng
F.
et al. (
2022
) ‘Language-agnostic BERT sentence embedding’, in: Muresan, S., Nakov, P., and Villavicencio, A. (eds) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
878
91
. Cedarville, OH: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.62.

Fergadiotis
G.
,
Wright
H. H.
,
West
T. M.
(
2013
) ‘
Measuring Lexical Diversity in Narrative Discourse of People with Aphasia’,
American Journal of Speech-Language Pathology
,
22
:
397
408
. https://doi.org/10.1044/1058-0360(2013/12-0083).

Franco-Salvador
M.
et al. (
2017
) ‘Subword-based deep averaging networks for author profiling in social media’, in: Capellato, L., Ferro, N., Goeuriot, L., and Mandl, T. (eds) Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Vol. 1866. Dublin, Ireland: CEUR Workshop Proceedings. https://ceur-ws.org/Vol-1866/paper_192.pdf.

Frawley
W.
(
1985
) ‘
Translation. Literary, Linguistic, and Philosophical Perspectives’,
Babel
,
31
:
106
7
. https://doi.org/10.1075/babel.31.2.19tra.

García
A. M.
,
Martín
J. C.
(
2007
) ‘
Function Words in Authorship Attribution Studies’,
Literary and Linguistic Computing
,
22
:
49
66
. https://doi.org/10.1093/llc/fql048.

Gellerstam
M.
(
1986
) ‘Translationese in Swedish novels translated from English’, in: Wollin, L. and Lindquist, H. (eds) Translation Studies in Scandinavia: Proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II, pp.
88
95
. Lund: CWK Gleerup.

Grieve
J.
(
2007
) ‘
Quantitative Authorship Attribution: An Evaluation of Techniques’,
Literary and Linguistic Computing
,
22
:
251
70
. https://doi.org/10.1093/llc/fqm020.

Hirsch
J. E.
(
2005
) ‘
An Index to Quantify an Individual’s Scientific Research Output’,
Proceedings of the National Academy of Sciences of the United States of America
,
102
:
16569
72
. https://doi.org/10.1073/pnas.0507655102.

Hoenen
A.
(
2017
) ‘Using word embeddings for computing distances between texts and for authorship attribution’, in: Frasincar, F., et al. (eds) Natural Language Processing and Information Systems. Proceedings of the 22nd International Conference on Applications of Natural Language to Information Systems, NLDB 2017, pp.
274
77
. Berlin, Germany: Springer International Publishing. https://doi.org/10.1007/978-3-319-59569-6_33.

Jin
M.
,
Murakami
M.
(
2007
) ‘
Authorship Identification Ising Random Forests’,
Proceedings of the Institute of Statistical Mathematics
,
55
:
255
68
.

Joulin
A.
et al. (
2017
) ‘Bag of Tricks for Efficient Text Classification’, in: Lapata, M., Blunsom, P., and Koller, A. (eds) Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 2, Short Papers, pp.
427
31
. Cedarville, OH: Association for Computational Linguistics. https://aclanthology.org/E17-2068.

Juola
P.
,
Mikros
G. K.
,
Vinsick
S.
(
2019a
) ‘
A Comparative Assessment of the Difficulty of Authorship Attribution in Greek and in English’,
Journal of the Association for Information Science and Technology
,
70
:
61
70
. https://doi.org/10.1002/asi.24073.

Juola
P.
,
Mikros
G. K.
,
Vinsick
S.
(
2019b
) ‘
Correlations and Potential Cross-Linguistic Indicators of Writing Style’,
Journal of Quantitative Linguistics
,
26
:
146
71
. https://doi.org/10.1080/09296174.2018.1458395.

Kestemont
M.
(
2014
) ‘Function words in authorship attribution. From black magic to theory’, Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), pp.
59
66
. Cedarville, OH: Association for Computational Linguistics. https://doi.org/10.3115/v1/W14-0908.

Koppel
M.
,
Argamon
S. E.
,
Shimoni
A. R.
(
2002
) ‘
Automatically Categorizing Written Texts by Author Gender’,
Literary and Linguistic Computing
,
17
:
401
12
. https://doi.org/10.1093/llc/17.4.401.

Kubát
M.
,
Matlach
V.
,
Čech
R.
(
2014
)
QUITA: Quantitative Index Text Analyzer
.
Lüdenscheid, Germany
:
RAM-Verlag
.

Kumar
T.
,
Gowtham
S.
,
Chakraborty
U. K.
(
2022
) ‘Comparing Word Embeddings on Authorship Identification’, in:
Borah
S.
,
Panigrahi
R.
(eds)
Applied Soft Computing: Tecniques and Applications
, pp.
177
94
.
Williston, VT
:
Apple Academic Press
. https://doi.org/10.1201/9781003186885.

Lample
G.
et al. (
2018
) ‘Unsupervised Machine Translation Using Monolingual Corpora Only’, In ICLR 2018 Conference Track. 6th International Conference on Learning Representations, April 30 - May 3, 2018. Vancouver, Canada: ICLR.

Lee
C. S.
(
2019
) ‘
Stylometric Comparative Analysis of Style in Human vs. Machine Literary Translations’,
The Journal of Translation Studies
,
20
:
111
30
. https://doi.org/10.15749/jts.2019.20.2.005.

Loh
A.
,
Soo
K.
,
Xing
H.
(
2016
) ‘Predicting Sexual Orientation based on Facebook Status’, Projects in the Course CS229: Machine Learning. Retrieved 12 October 2023 from http://cs229.stanford.edu/proj2016/report/.

Long
M. H.
,
Richards
J. C.
(
2007
) ‘Series Editors’ Preface’, in:
Daller
H.
,
Milton
J.
,
Treffers-Daller
J.
(eds)
Modelling and Assessing Vocabulary Knowledge
, pp.
1
33
.
Cambridge
:
Cambridge University Press
.

López-Santillán
R.
et al. (
2020
) ‘
Richer Document Embeddings for Author Profiling Tasks Based on a Heuristic Search’,
Information Processing & Management
,
57
:
102227
. https://doi.org/10.1016/j.ipm.2020.102227.

Lundeqvist
E.
,
Svensson
M.
(
2017
) ‘Author Profiling: A Machine Learning Approach Towards Detecting Gender, Age and Native Language of Users in Social Media’, Master thesis, Uppsala University, Uppsala, Denmark. https://urn.kb.se/resolve?urn=urn%3Anbn%3Ase%3Auu%3Adiva-330464.

Luyckx
K.
,
Daelemans
W.
(
2008
) ‘Authorship attribution and verification with many authors and limited data’, Proceedings of the 22nd International Conference on Computational Linguistics, (COLING '08), Vol. 1, pp.
513
520
. Cedarville, OH: Association for Computational Linguistics. https://dl.acm.org/doi/pdf/10.5555/1599081.1599146.

McCollister
C.
(
2016
) ‘Predicting Author Traits Through Topic Modeling of Multilingual Social Media Text’, MSc thesis, University of Kansas, Lawrence, KS. https://kuscholarworks.ku.edu/bitstream/handle/1808/22342/McCollister_ku_0099M_14757_DATA_1.pdf.

McIntosh
R. P.
(
1967
) ‘
An Index of Diversity and the Relation of Certain Concepts to Diversity’,
Ecology
,
48
:
392
404
. https://doi.org/10.2307/1932674.

Mikros
G. K.
(
2020
) ‘Finding the Author of a Translation. An Experiment in Authorship Attribution Using Machine Learning Methods in Original Texts and Translations of the Same Author’, in:
Kelih
E.
,
Köhler
R.
(eds)
Words and Numbers. In Memory of Peter Grzybek (1957–2019)
, pp.
71
82
.
Lüdenscheid, Germany
:
RAM-Verlag
.

Mikros
G. K.
(
2018
) ‘Blended Authorship Attribution: Unmasking Elena Ferrante Combining Different Author Profiling Methods’, in: Tuzzi, A., and Cortelazzo, M. (eds) Drawing Elena Ferrante’s Profile: Workshop Proceedings, pp.
85
95
. Padova PD, Italy: Padova University Press.

Mikros
G. K.
(
2013a
) ‘Authorship Attribution and Gender Identification in Greek Blogs’, in:
Obradović
I.
,
Kelih
E.
,
Köhler
R.
(eds)
Methods and Applications of Quantitative Linguistics in Belgrade, Serbia, April 16-19, 2012
, pp.
21
32
.
New Delhi, India
:
Academic Mind
.

Mikros
G. K.
(
2013b
) ‘Systematic Stylometric Differences in Men and Women Authors: A Corpus-based Study’, in:
Köhler
R.
,
Altmann
G.
(eds)
Issues in Quantitative Linguistics 3. Dedicated to Karl-Heinz Best on the Occasion of his 70th Birthday
, pp.
206
23
.
Lüdenscheid, Germany
:
RAM-Verlag
.

Mikros
G. K.
,
Perifanos
K.
(
2013
) ‘Authorship Attribution in Greek Tweets Using Multilevel Author’s n-gram profiles’, in: Hovy, E., et al. (eds) Papers from the 2013 AAAI Spring Symposium «Analyzing Microtext», pp.
17
23
. Washington, DC: AAAI Press. http://www.aaai.org/ocs/index.php/SSS/SSS13/paper/view/5714/5914.

Mikros
G., K.
,
Perifanos
K.
(
2015
) ‘Gender Identification in Modern Greek Tweets’, in:
Tuzzi
A.
,
Benešová
M.
,
Macutek
J.
(eds)
Recent Contributions to Quantitative Linguistics
, Vol.
70
, pp.
75
88
.
Berlin, Germany
:
De Gruyter
. https://doi.org/10.1515/9783110420296-008.

Neal
T.
et al. (
2017
) ‘
Surveying Stylometry Techniques and Applications’,
ACM Computing Surveys
,
50
:
1–36
. https://doi.org/10.1145/3132039.

Nöth
W.
(
1995
)
Handbook of Semiotics
.
Bloomington, IN
:
Indiana University Press
.

Nowson
S.
(
2006
)
The Language of Weblogs: A Study of Genre and Individual Differences
.
Edinburgh, Scotland
:
University of Edinburgh
.

Oakes
M. P.
(
1998
)
Statistics for Corpus Linguistics
.
Edinburgh, Scotland
:
Edinburgh University Press
.

Owen
A. J.
,
Leonard
L. B.
(
2002
) ‘
Lexical Diversity in the Spontaneous Speech of Children with Specific Language Impairment: Application of D’,
Journal of Speech Language and Hearing Research
,
45
:
927
37
. https://doi.org/10.1044/1092-4388(2002/075).

Popescu
I. I.
(
2007
) ‘The Ranking by the Weight of Highly Frequent Words’, in:
Grzybek
P.
,
Köhler
R.
(eds)
Exact Methods in the Study of Language and Text
, pp.
555
565
.
Berlin, Germany
:
De Gruyter
.

Popescu
I. I.
et al. (
2009
)
Word Frequency Studies
.
Berlin, Germany
:
Mouton de Gruyter
.

Popescu
I. I.
,
Altmann
G.
(
2007
) ‘
Writer’s View of Text Generation’,
Glottometrics
,
15
:
71
81
.

Popescu
I. I.
,
Best
K. H.
,
Altmann
G.
(
2007
) ‘
On the Dynamics of Word Classes in Text’,
Glottometrics
,
14
:
58
71
.

Popescu
I. I.
et al. (
2010
)
Vectors and Codes of Text
.
Lüdenscheid, Germany
:
RAM-Verlag
.

Posadas-Durán
J. P.
et al. (
2017
) ‘
Application of the Distributed Document Representation in the Authorship Attribution Task for Small Corpora’,
Soft Computing
,
21
:
627
39
. https://doi.org/10.1007/s00500-016-2446-x.

Potamianos
G.
,
Jelinek
F.
(
1998
) ‘
A Study of N-gram and Decision Tree Letter Language Modeling Methods’,
Speech Communication
,
24
:
171
92
. https://doi.org/10.1016/S0167-6393(98)00018-1.

Rangel
F.
,
Rosso
P.
(
2013
) ‘Use of Language and Author Profiling: Identification of Gender and Age’, In B. Sharp & M. Zock (eds), Proceedings of the 10th Workshop on Natural Language Processing and Cognitive Science (NLPCS-2013), Marseille, France, 15–16 October 2013. Marseille, France: ACL. pp. 177–186.

Rao
S.
,
Raju
G.
,
Kumar
V.
(
2017
) ‘
Authorship Attribution on Imbalanced English Editorial Corpora’,
International Journal of Computer Applications
,
169
:
44
7
. https://www.ijcaonline.org/archives/volume169/number1/rao-2017-ijca-914587.pdf.

Read
J.
(
2000
)
Assessing Vocabulary
.
Cambridge
:
Cambridge University Press
.

Ruder
S.
,
Ghaffari
P.
,
Breslin
J. G.
(
2016
) ‘Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution’, Computing Research Repository (CoRR). Retrieved 10 October 2023 from http://arxiv.org/abs/1609.06686.

Saldanha
G.
(
2011
) ‘
Translator Style’,
The Translator
,
17
:
25
50
. https://doi.org/10.1080/13556509.2011.10799478.

Sari
Y.
(
2018
)
Neural and Non-neural Approaches to Authorship Attribution
.
England
:
University of Sheffield
. https://etheses.whiterose.ac.uk/21415/1/FinalThesis_Yunita.pdf.

Schaetti
N.
(
2017
) ‘UniNE at CLEF 2017: TF-IDF and Deep-learning for Author profiling’, in: Cappellato, L., et al. (eds) Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Vol. 1866. CEUR-WS.org. https://ceur-ws.org/Vol-1866/paper_80.pdf.

Shrestha
P.
et al. (
2017
) ‘Convolutional neural networks for authorship attribution of short texts’, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 2, Short Papers, pp.
669
74
. Valencia, Spain: ACL. https://aclanthology.org/E17-2106.

Smynor
N.
(
2015
) ‘
Behavioral Profiling in Translation Studies’,
Trans-Kom Zeitschrift Für Translationswissenschaft Und Fachkommunikation
,
8
:
483
98
. http://www.trans-kom.eu/bd08nr02/trans-kom_08_02_08_Szymor_Profiling.20151211.pdf.

Stamatatos
E.
et al. (
2018
) ‘Overview of PAN 2018’, in:
Bellot
P.
, et al. (eds)
Experimental IR Meets Multilinguality, Multimodality, and Interaction
, pp.
267
85
.
Cham, Switzerland
:
Springer International Publishing
.

Stoll
A.
(
2017
) ‘Post Hoc Tests: Tukey Honestly Significant Difference Test’, in:
Allen
M.
(ed.)
The SAGE Encyclopedia of Communication Research Methods
, pp.
1306
7
.
Thousand Oaks, CA
:
SAGE Publications, Inc
. https://doi.org/10.4135/9781483381411.n452.

Tanaka
R.
,
Jin
M.
(
2014
) ‘
Authorship Attribution of Cell-phone E-mail’,
International Journal on Information (Japan)
,
17
:
1217
26
.

Treffers-Daller
J.
(
2011
) ‘
Operationalizing and Measuring Language Dominance’,
International Journal of Bilingualism
,
15
:
147
63
. https://doi.org/10.1177/1367006910381186.

Treffers-Daller
J.
,
Korybski
T.
(
2015
) ‘Using Lexical Diversity Measures to Operationalise Language Dominance in Bilinguals’, in:
Silva-Corvalan
C.
,
Treffers-Daller
J.
(eds)
Language Dominance in Bilinguals: Issues of Measurement and Operationalization
, pp. 106–123.
Cambridge
:
Cambridge University Press
. http://centaur.reading.ac.uk/39019/.

Tumasjan
A.
et al. (
2010
) ‘Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment’,
Proceedings of the International AAAI Conference on Web and Social Media
,
4
:
178
85
. https://ojs.aaai.org/index.php/ICWSM/article/view/14009.

Tweedie
F. J.
,
Baayen
H. R.
(
1998
) ‘
How Variable May a Constant Be? Measures of Lexical Richness in Perspective’,
Computers and the Humanities
,
32
:
323
52
.

van Halteren
H.
et al. (
2005
) ‘
New Machine Learning Methods Demonstrate the Existence of a Human Stylome’,
Journal of Quantitative Linguistics
,
12
:
65
77
. https://doi.org/10.1080/09296170500055350.

van Velzen
M. H.
,
Nanetti
L.
,
de Deyn
P. P.
(
2014
) ‘
Data Modelling in Corpus Linguistics: How Low May WeGgo’,
Cortex
,
55
:
192
201
. https://doi.org/10.1016/j.cortex.2013.10.010.

Veenhoven
R.
et al. (
2018
) ‘Using Translated Data to Improve Deep Learning Author Profiling Models: Notebook for PAN at CLEF 2018’, in: Cappellato, L., et al. (eds) Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum. Avignon, France: CEUR-Workshop Proceedings. http://ceur-ws.org/Vol-2125/paper_178.pdf.

Wu
H.
,
Zhang
Z.
,
Wu
Q.
(
2021
) ‘
Exploring Syntactic and Semantic Features for Authorship Attribution’,
Applied Soft Computing
,
111
:
107815
. https://doi.org/107815.10.1016/j.asoc.2021.107815.

Wu
Y.
et al. (
2016
) ‘Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’, Computing Research Repository (CoRR), http://arxiv.org/abs/1609.08144.

Xu
P.
,
Jelinek
F.
(
2004
) ‘Random Forests in Language Modelin’, in:
Lin
D.
,
Wu
D.
(eds)
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing
, pp.
325
332
.
Cedarville, OH
:
Association for Computational Linguistics
. https://aclanthology.org/W04-3242.

Yule
U. G.
(
1944
)
The Statistical Study of Literary Vocabulary
.
Cambridge
:
Cambridge University Press
.

Zhang
X.
,
Zhao
J.
,
LeCun
Y.
(
2015
) ‘Character-level convolutional networks for text classification’, Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol, 1, pp.
649
57
. Cambridge, MA: MIT Press.

Zhao
Y.
,
Zobel
J.
(
2005
) ‘Effective and Scalable Authorship Attribution Using Function Words’, in:
Lee
G.
, et al. (eds)
Information Retrieval Technology
, Vol.
3689
, pp.
174
189
.
Berlin, Germany
:
Springer
. https://doi.org/10.1007/11562382_14.

Zheng
W.
,
Jin
M.
(
2022
) ‘
A Review on Authorship Attribution in Text Mining’,
WIREs Computational Statistics
,
15
:
e1584
. https://doi.org/10.1002/wics.1584.

Author notes

Work done during Master’s at University of Antwerp.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.