Enhancing static code analysis in reverse engineering using retrieval-augmented generation and large language models: a thesis in Computer Engineering

Brendan C. Thibault

doi:10.62791/20450

Back

Thesis

Open access

Enhancing static code analysis in reverse engineering using retrieval-augmented generation and large language models: a thesis in Computer Engineering

Brendan C. Thibault

Master of Science (MS), University of Massachusetts Dartmouth

2025

DOI:

https://doi.org/10.62791/20450

Abstract

This thesis explores the application of Large Language Models, (LLMs), in a reverse engineering context to perform accurate static analysis of code. A novel approach was taken using retrieval-augmented generation (RAG) to enhance the model output with contextual metadata, improving the accuracy and relevance of the generated documentation. This methodology was implemented within an Artificial Intelligence, (AI), enabled reverse engineering platform that allows users to perform static analysis and includes features such as a linear disassembly view, graph-based navigation, and AI-driven code summaries, among others. The results show that a RAG-based approach outperforms previous methods of LLM assisted reverse engineering. This research demonstrates significant performance and accuracy improvements using an AI enabled framework, and highlights the potential of integrating LLMs into reverse engineering workflows. The results of this work have practical implications in reverse engineering, and show that a RAG enhanced LLM approach can significantly assist in the reverse engineering process. Future directions of this work consist of identifying the most optimal context to be resolved during the RAG retrieval process using an unsupervised machine learning approach, as well as incorporating more powerful reasoning models.

Files and links (1)

pdf

Thibault B.C. COE MS Thesis 20254.17 MBDownload View

CC BY-NC-ND V4.0, Open Access

Metrics

56 File views/ downloads

250 Record Views

Details

Title: Enhancing static code analysis in reverse engineering using retrieval-augmented generation and large language models
Creators: Brendan C. Thibault
ORCID: 0009-0000-7407-5687
Contributors: Lance Fiondella (Advisor) - University of Massachusetts Dartmouth, Department of Electrical and Computer Engineering
Hong Liu (Committee Member) - University of Massachusetts Dartmouth, Department of Electrical and Computer Engineering
Ruolin Zhou (Committee Member) - University of Massachusetts Dartmouth, Department of Electrical and Computer Engineering
Gokhan Kul (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Number of pages: viii, 50 pages
Illustrations: illustrations (chiefly color)
Table of contents: List of figures -- List of tables -- Chapter 1. Introduction -- Chapter 2. Background -- Introduction to binary reverse engineering -- Traditional static analysis techniques and tools -- Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) -- Related work in AI-assisted reverse engineering and the research gap -- Chapter 3. Methods -- Retrieval-augmented generation (RAG) methodology -- Prompt engineering for binary analysis -- Evaluation metrics -- Chapter 4. AI enabled binary analysis framework -- System architecture -- Backend implementation -- User interface -- Results -- Conclusion.
References: Includes bibliographical references (pages 48-50).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Master of Science (MS)
Degree in: Computer Engineering
Academic Unit: Department of Electrical and Computer Engineering
Language: English
Resource Type: Thesis
DOI: https://doi.org/10.62791/20450
Record Identifier: 9914444128201301