Fine-tuning multimodal large language models for medical visual question answering: instruction tuning with region of interest attention : a thesis in Data Science

Ankitaben Mungalpara

doi:10.62791/20485

Back

Fine-tuning multimodal large language models for medical visual question answering: instruction tuning with region of interest attention : a thesis in Data Science

Thesis

Open access

Fine-tuning multimodal large language models for medical visual question answering: instruction tuning with region of interest attention : a thesis in Data Science

Ankitaben Mungalpara

Master of Science (MS), University of Massachusetts Dartmouth

2025

DOI:

https://doi.org/10.62791/20485

Abstract

Multimodal large language models (MLLMs), which can reason across different modalities, are opening up new frontiers in AI development that will soon enable complex reasoning with both visual and textual modalities. One application of multimodal MLLMs is Visual Question Answering (VQA) in healthcare; for example, a clinician can ask questions about medical images in natural language to receive detailed explanations. As suitably adapted for the modern age: "The art of medicine consists in amusing the patient while nature cures the disease - but in the age of AI, assisting the doctor while learning from the data." - Inspired by Voltaire, adapted for modern AI healthcare This research examines the efficacy of fine-tuning multimodal (vision and language) foundational models for performing medical visual question answering. This study aims to evaluate how effectively these models can interpret and respond to medical queries based on visual inputs, ultimately enhancing diagnostic accuracy and patient care. By leveraging the strengths of both vision and language processing, the research seeks to advance the capabilities of AI in medical settings. The author studies three fine-tuned variants derived from the baseline LLaVA-Med architecture: a caption-only baseline trained to generate global descriptions of medical images; a model that uses instruction tuning to better answer diagnostic and region-specific questions, and a variant that incorporates Alpha-CLIP with LLaVA-Med to allow spatially targeted understanding by conditioning responses on user-defined regions of interest (ROIs). All variants are adapted utilizing Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning (PEFT) technique that facilitates minimal computational overhead. The results show that the foundational LLaVA-Med model, which was trained only on image-caption pairs, performs fairly well with BLEU (0.09) and ROUGE-2 (0.14), but it struggles to answer specific questions compared to the improved version that has the best ROUGE-1 (0.59) and ROUGE-L (0.54). The region-focused Alpha-CLIP combined with LLaVA-Med achieved the best performance on both the ROUGE-2 (0.40) and BLEU (0.28), while still exhibiting accurate and context-sensitive reasoning, which points to the importance of task-specific adaptation in medical VQA. The combined results suggest that concentrating attention on specific areas within images, along with instruction-based reasoning, is essential for medical AI systems that manage diverse types of data. This approach facilitates the comparison of methods and provides a detailed strategy for improving specialized diagnostic medical AI assistants.

Files and links (1)

pdf

Mungalpara A. COE MS Thesis 20255.13 MBDownload View

CC BY-NC-ND V4.0, Open Access

Metrics

35 File views/ downloads

51 Record Views

Details

Title: Fine-tuning multimodal large language models for medical visual question answering
Creators: Ankitaben Mungalpara
ORCID: 0009-0004-7856-8453
Contributors: Ming Daniel Shao (Advisor) - University of Massachusetts Lowell
Jiawei Yuan (Advisor) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Ashokkumar Ratilal Patel (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Number of pages: xii, 79 pages
Illustrations: illustrations (chiefly color)
Table of contents: List of Figures -- List of Tables -- Chapter 1. Introduction -- Background -- Motivation -- Research identification -- Synopsis -- Chapter 2. Literature survey -- Foundations of visual question answering (VQA) -- Medical visual question answering -- Multimodal large language models (MLLMs) -- Chapter 3. Model architecture and design -- Overall framework -- Vision-language model selection -- Medical VQA task definition -- Model architecture and components -- Comparative overview of VQA models configurations -- Chapter 4. Fine-tuning methodology and training strategy -- Overview of fine-tuning strategy -- Foundational LLaVA-Med fine-tuning -- Instruction-tuned LLaVA-Med fine-tuning -- Region-focused Alpha-CLIP + LLaVA-Med fine-tuning -- Parameter-efficient fine-tuning (PEFT) -- Chapter 5. Dataset preparation -- Overview -- Data triplet structure -- Prompt design for multi-turn medical conversations -- Chapter 6. Evaluation and results -- Evaluation setup -- Quantitative results -- Training behavior -- Visual and textual reasoning in practice -- Chapter 7. Conclusion and recommendations -- Contributions -- Key findings -- Limitations -- Recommendations -- Final remarks -- References.
References: Includes bibliographical references (pages 75-79).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Master of Science (MS)
Degree in: Data Science
Academic Unit: Department of Computer and Information Science
Language: English
Resource Type: Thesis
DOI: https://doi.org/10.62791/20485
Record Identifier: 9914504462201301