Abstract
The boom in generative artificial intelligence models has brought great change towards many facets of society. At the forefront of these technologies are large language models (LLMs),which produce coherent and comprehensive textual responses to user input. The widespread adoption of LLMs has brought concerns for their susceptibility to jailbreaking, in which malicious or harmful prompts are provided with the hope of eliciting a response that violates the model’s ethical guidelines. Techniques for handling such prompts vary from model to model, but are never perfect and the threat of jailbreaking remains omnipresent. This paper presents a technique for evaluating a model’s susceptibility to jailbreaking prompts, using metrics designed for evaluating encoding moral beliefs in response to ethical dilemmas. A survey of jailbreak prompts was presented to a LLM and its responses were analyzed using these metrics. Results indicated that the models surveyed were more susceptible than expected, choosing to comply with harmful requests with a surprising level of frequency. The causes and implications for this behavior are discussed in detail, as well as points for further investigation.