Logo image
Multimodal atom-of-thought reasoning in science question answering: a thesis in Data Science
Thesis   Open access

Multimodal atom-of-thought reasoning in science question answering: a thesis in Data Science

Yashika Patil
Master of Science (MS), University of Massachusetts Dartmouth
2025
DOI:
https://doi.org/10.62791/20487

Abstract

Multimodal Visual Question Answering (VQA) in science requires reasoning over domain-specific text and diagrams. While large language models (LLMs) excel at text reasoning, they often rely on image captions, losing spatial and relational details. The ScienceQA benchmark (21k questions with lectures and diagrams) highlights this gap. Chain-of-Thought (CoT) prompting improves reasoning but suffers from error propagation and long context dependencies. Atom-of-Thought (AoT) addresses these issues by decomposing problems into independent, self-contained "atoms" organized in a Directed Acyclic Graph (DAG), each conditioned only on the question and visual input. This modular, DAG-based structure minimizes dependencies, reduces error accumulation, and enables parallel reasoning. This thesis is the first to explore AoT in multimodal VQA. The contributions are: (1) adapting AoT for visual-text integration, (2) instruction-tuning LLaVA-7B with AoT rationales generated via GPT, and (3) evaluating accuracy and reasoning clarity on ScienceQA. Results show that in zero-shot evaluation, fine-tuned LLaVA-7B achieves 51.97% accuracy (vs. 51.50% base), while GPT-4o-mini with AoT reaches 83.09%, surpassing the GPT-3 CoT baseline of 75.17%.
pdf
Patil Y. COE MS Thesis 20253.50 MBDownloadView
CC BY-NC-ND V4.0 Open Access

Metrics

1328 File views/ downloads
30 Record Views

Details

Logo image