Applying statistical measures of belief to an analysis of the susceptibility of large language models to jailbreaking: a thesis in Computer Science

Eric Timothy Faith

doi:10.62791/20504

Back

Thesis

Open access

Applying statistical measures of belief to an analysis of the susceptibility of large language models to jailbreaking: a thesis in Computer Science

Eric Timothy Faith

Master of Science (MS), University of Massachusetts Dartmouth

2025

DOI:

https://doi.org/10.62791/20504

Abstract

The boom in generative artificial intelligence models has brought great change towards many facets of society. At the forefront of these technologies are large language models (LLMs),which produce coherent and comprehensive textual responses to user input. The widespread adoption of LLMs has brought concerns for their susceptibility to jailbreaking, in which malicious or harmful prompts are provided with the hope of eliciting a response that violates the model’s ethical guidelines. Techniques for handling such prompts vary from model to model, but are never perfect and the threat of jailbreaking remains omnipresent. This paper presents a technique for evaluating a model’s susceptibility to jailbreaking prompts, using metrics designed for evaluating encoding moral beliefs in response to ethical dilemmas. A survey of jailbreak prompts was presented to a LLM and its responses were analyzed using these metrics. Results indicated that the models surveyed were more susceptible than expected, choosing to comply with harmful requests with a surprising level of frequency. The causes and implications for this behavior are discussed in detail, as well as points for further investigation.

Files and links (1)

pdf

Faith E.T. COE MS Thesis 20251.81 MBDownload View

CC BY-NC-ND V4.0, Open Access

Metrics

4 File views/ downloads

13 Record Views

Details

Title: Applying statistical measures of belief to an analysis of the susceptibility of large language models to jailbreaking
Creators: Eric Timothy Faith
ORCID: 0009-0004-2501-1862
Contributors: Long Jiao (Advisor) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Adnan El-Nasan (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Jiawei Yuan (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Number of pages: viii, 45 pages
Illustrations: illustrations (chiefly color)
Table of contents: List of Figures -- List of Tables -- Chapter 1. Introduction -- Large language models -- Jailbreaking an LLM -- Summary of proposed solution -- Chapter 2. Related work -- Theory-of-mind in large language models -- Defining statistical measures for evaluating encoded beliefs -- Chapter 3. System architecture -- Transformers -- Attention mechanism -- Transformers and jailbreaking -- Chapter 4. Proposed solution -- Token-to-action mapping -- Dataset generation -- Question forms -- Model selection -- Survey administration -- Likelihood and entropy-based belief metrics -- Diagnostic metrics -- Chapter 5. Experimental and numerical results -- Chapter 6. Discussion -- Limitations -- Future research directions -- References.
References: Includes bibliographical references (pages 44-45).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Master of Science (MS)
Degree in: Computer Science
Academic Unit: Department of Computer and Information Science
Language: English
Resource Type: Thesis
DOI: https://doi.org/10.62791/20504
Record Identifier: 9914504161001301