Author attribution for Reddit data: a comparative study on machine learning models : a thesis in Data Science

Sairama Amulya Baswa

doi:10.62791/20519

Back

Thesis

Open access

Author attribution for Reddit data: a comparative study on machine learning models : a thesis in Data Science

Sairama Amulya Baswa

Master of Science (MS), University of Massachusetts Dartmouth

2025

DOI:

https://doi.org/10.62791/20519

Abstract

The proliferation of user-generated content on online platforms like Reddit has created vast datasets that can be used to extract significant knowledge, such as author attribution through various analytical techniques. Author attribution is the task of identifying the writer of a given text, or in this case, comments or posts. This analysis can help us understand how people interact on Reddit, and it could also be useful for detecting issues such as fake accounts, secondary accounts, or just getting a better sense of how individual users behave. Namely, author verification and author attributions help us identify if two or more accounts are from the same person or genuinely from different people. However, the challenge is that figuring out this information from short and casual posts like you often see on Reddit is not straightforward. The brevity usually leads to sparse feature representations, and informal language can obscure consistent stylistic patterns that are readily apparent in longer, more formal writing. Existing author attribution methods may not be optimally suited for capturing the subtle yet distinctive linguistic cues that differentiate authors within the specific context of Reddit’s or similar platforms’ communication style. This research addresses the challenge of author attribution in Reddit comments by employing a comprehensive feature engineering strategy coupled with the evaluation of several machine learning and deep learning models. Our hypothesis is that, by looking at different language details, especially stylometric details that are typical of Reddit posts, we can figure out if two comments were written by the same person. Our literature survey suggested that stylometry analysis is more relevant for author attribution or verification. To assess the effectiveness of these engineered features, we evaluate the performance of three distinct classification models: a decision tree, a random forest, and a bidirectional long short-term memory (LSTM) network. In bidirectional LSTM, we use two different inputs and see the performance. The outcomes of this research hold significant implications for various stakeholders. Researchers in natural language processing (NLP) and computational social science can leverage our feature engineering pipeline and the comparative model analysis as a benchmark and a source of inspiration for future work on author attribution in online contexts. Furthermore, understanding the linguistic markers that characterize individual authors on platforms like Reddit can inform sociolinguistic studies and the evolution of online communication. Ultimately, this work aims to advance the state-of-the-art author attribution for short, informal text data, providing practical tools and valuable insights for a range of applications.

Files and links (1)

pdf

Baswa S.A. COE MS Thesis 20251.64 MBDownload View

Open Access CC BY-NC-ND V4.0

Metrics

1671 File views/ downloads

38 Record Views

Details

Title: Author attribution for Reddit data
Creators: Sairama Amulya Baswa
ORCID: 0009-0007-2468-4039
Contributors: Gokhan Kul (Advisor) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Yuchou Chang (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Ashokkumar Ratilal Patel (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Number of pages: xii, 94 pages
Illustrations: illustrations (chiefly color)
Table of contents: List of Figures -- List of Tables -- Abbreviations -- Chapter 1. Introduction -- Chapter 2. Literature review -- Authorship attribution in social media: state-of-the-art and notable contributions -- What is unknown and the gap in the literature -- Chapter 3. Methodology -- Fill the gap -- Overall methodology -- Exploratory data analysis -- POS analysis, stylometric analysis, and TF-IDF analysis -- Preprocessing of the data -- Baseline methods: decision tree and random forest with TF-IDF of distorted POS sequences -- Proposed method -- Evaluation -- Chapter 4. Dataset -- Data acquisition: sourcing Reddit data -- Data preprocessing: preparing the dataset for analysis -- Chapter 5. Results -- Temporal dynamics and co-activity patterns of authors -- Performance comparison -- Chapter 6. Discussions -- Chapter 7. Conclusion -- Bibliography.
References: Includes bibliographical references (pages 90-94).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Master of Science (MS)
Degree in: Data Science
Academic Unit: Department of Computer and Information Science
Language: English
Resource Type: Thesis
DOI: https://doi.org/10.62791/20519
Record Identifier: 9914504464101301