Abstract
The proliferation of user-generated content on online platforms like Reddit has created vast datasets that can be used to extract significant knowledge, such as author attribution through various analytical techniques. Author attribution is the task of identifying the writer of a given text, or in this case, comments or posts. This analysis can help us understand how people interact on Reddit, and it could also be useful for detecting issues such as fake accounts, secondary accounts, or just getting a better sense of how individual users behave. Namely, author verification and author attributions help us identify if two or more accounts are from the same person or genuinely from different people. However, the challenge is that figuring out this information from short and casual posts like you often see on Reddit is not straightforward. The brevity usually leads to sparse feature representations, and informal language can obscure consistent stylistic patterns that are readily apparent in longer, more formal writing. Existing author attribution methods may not be optimally suited for capturing the subtle yet distinctive linguistic cues that differentiate authors within the specific context of Reddit’s or similar platforms’ communication style. This research addresses the challenge of author attribution in Reddit comments by employing a comprehensive feature engineering strategy coupled with the evaluation of several machine learning and deep learning models. Our hypothesis is that, by looking at different language details, especially stylometric details that are typical of Reddit posts, we can figure out if two comments were written by the same person. Our literature survey suggested that stylometry analysis is more relevant for author attribution or verification. To assess the effectiveness of these engineered features, we evaluate the performance of three distinct classification models: a decision tree, a random forest, and a bidirectional long short-term memory (LSTM) network. In bidirectional LSTM, we use two different inputs and see the performance. The outcomes of this research hold significant implications for various stakeholders. Researchers in natural language processing (NLP) and computational social science can leverage our feature engineering pipeline and the comparative model analysis as a benchmark and a source of inspiration for future work on author attribution in online contexts. Furthermore, understanding the linguistic markers that characterize individual authors on platforms like Reddit can inform sociolinguistic studies and the evolution of online communication. Ultimately, this work aims to advance the state-of-the-art author attribution for short, informal text data, providing practical tools and valuable insights for a range of applications.