Logo image
Applying statistical measures of belief to an analysis of the susceptibility of large language models to jailbreaking: a thesis in Computer Science
Thesis   Open access

Applying statistical measures of belief to an analysis of the susceptibility of large language models to jailbreaking: a thesis in Computer Science

Eric Timothy Faith
Master of Science (MS), University of Massachusetts Dartmouth
2025
DOI:
https://doi.org/10.62791/20504

Abstract

The boom in generative artificial intelligence models has brought great change towards many facets of society. At the forefront of these technologies are large language models (LLMs),which produce coherent and comprehensive textual responses to user input. The widespread adoption of LLMs has brought concerns for their susceptibility to jailbreaking, in which malicious or harmful prompts are provided with the hope of eliciting a response that violates the model’s ethical guidelines. Techniques for handling such prompts vary from model to model, but are never perfect and the threat of jailbreaking remains omnipresent. This paper presents a technique for evaluating a model’s susceptibility to jailbreaking prompts, using metrics designed for evaluating encoding moral beliefs in response to ethical dilemmas. A survey of jailbreak prompts was presented to a LLM and its responses were analyzed using these metrics. Results indicated that the models surveyed were more susceptible than expected, choosing to comply with harmful requests with a surprising level of frequency. The causes and implications for this behavior are discussed in detail, as well as points for further investigation.
pdf
Faith E.T. COE MS Thesis 20251.81 MBDownloadView
CC BY-NC-ND V4.0 Open Access

Metrics

4 File views/ downloads
13 Record Views

Details

Logo image