Abstract

Twitter is an extensively used micro-blogging site for publishing user’s views on recent happenings. This wide reachability of messages over large audience poses a threat, as the degree of personally identifiable information disclosed might lead to user regrets. The Tweet-Scan-Post system scans the tweets contextually for sensitive messages. The tweet repository was generated using cyber-keywords for personal, professional and health tweets. The Rules of Sensitivity and Contextuality was defined based on standards established by various national regulatory bodies. The naive sensitivity regression function uses the Bag-of-Words model built from short text messages. The imbalanced classes in dataset result in misclassification with 25% of sensitive and 75% of insensitive tweets. The system opted stacked classification to combat the problem of imbalanced classes. The system initially applied various state-of-art algorithms and predicted 26% of the tweets to be sensitive. The proposed stacked classification approach increased the overall proportion of sensitive tweets to 35%. The system contributes a vocabulary set of 201 Sensitive Privacy Keyword using the boosting approach for three tweet categories. Finally, the system formulates a sensitivity scaling called TSP’s Tweet Sensitivity Scale based on Senti-Cyber features composed of Sensitive Privacy Keywords, Cyber-keywords with Non-Sensitive Privacy Keywords and Non-Cyber-keywords to detect the degree of disclosed sensitive information.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
You do not currently have access to this article.