Abstract
Multimodal Visual Question Answering (VQA) in science requires reasoning over domain-specific text and diagrams. While large language models (LLMs) excel at text reasoning, they often rely on image captions, losing spatial and relational details. The ScienceQA benchmark (21k questions with lectures and diagrams) highlights this gap. Chain-of-Thought (CoT) prompting improves reasoning but suffers from error propagation and long context dependencies. Atom-of-Thought (AoT) addresses these issues by decomposing problems into independent, self-contained "atoms" organized in a Directed Acyclic Graph (DAG), each conditioned only on the question and visual input. This modular, DAG-based structure minimizes dependencies, reduces error accumulation, and enables parallel reasoning. This thesis is the first to explore AoT in multimodal VQA. The contributions are: (1) adapting AoT for visual-text integration, (2) instruction-tuning LLaVA-7B with AoT rationales generated via GPT, and (3) evaluating accuracy and reasoning clarity on ScienceQA. Results show that in zero-shot evaluation, fine-tuned LLaVA-7B achieves 51.97% accuracy (vs. 51.50% base), while GPT-4o-mini with AoT reaches 83.09%, surpassing the GPT-3 CoT baseline of 75.17%.