Multimodal atom-of-thought reasoning in science question answering: a thesis in Data Science

Yashika Patil

doi:10.62791/20487

Back

Thesis

Open access

Multimodal atom-of-thought reasoning in science question answering: a thesis in Data Science

Yashika Patil

Master of Science (MS), University of Massachusetts Dartmouth

2025

DOI:

https://doi.org/10.62791/20487

Abstract

Multimodal Visual Question Answering (VQA) in science requires reasoning over domain-specific text and diagrams. While large language models (LLMs) excel at text reasoning, they often rely on image captions, losing spatial and relational details. The ScienceQA benchmark (21k questions with lectures and diagrams) highlights this gap. Chain-of-Thought (CoT) prompting improves reasoning but suffers from error propagation and long context dependencies. Atom-of-Thought (AoT) addresses these issues by decomposing problems into independent, self-contained "atoms" organized in a Directed Acyclic Graph (DAG), each conditioned only on the question and visual input. This modular, DAG-based structure minimizes dependencies, reduces error accumulation, and enables parallel reasoning. This thesis is the first to explore AoT in multimodal VQA. The contributions are: (1) adapting AoT for visual-text integration, (2) instruction-tuning LLaVA-7B with AoT rationales generated via GPT, and (3) evaluating accuracy and reasoning clarity on ScienceQA. Results show that in zero-shot evaluation, fine-tuned LLaVA-7B achieves 51.97% accuracy (vs. 51.50% base), while GPT-4o-mini with AoT reaches 83.09%, surpassing the GPT-3 CoT baseline of 75.17%.

Files and links (1)

pdf

Patil Y. COE MS Thesis 20253.50 MBDownload View

CC BY-NC-ND V4.0, Open Access

Metrics

1328 File views/ downloads

30 Record Views

Details

Title: Multimodal atom-of-thought reasoning in science question answering
Creators: Yashika Patil
ORCID: 0009-0003-0444-344X
Contributors: Jiawei Yuan (Advisor) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Ming Daniel Shao (Advisor) - University of Massachusetts Lowell
Yuchou Chang (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Number of pages: viii, 48 pages
Illustrations: illustrations (chiefly color)
Table of contents: List of figures -- List of tables -- 1. Introduction -- Background -- Motivation -- Thesis contributions -- Thesis organization -- 2. Related work -- ScienceQA: a benchmark for multimodal reasoning -- Chain-of-thought reasoning -- Atom of thought reasoning -- 3. Methodology -- ScienceQA data format -- Proposed framework -- Image captioning via BLIP-2 -- Data generation with GPT-4o-mini -- Data processing -- Instruction tuning LLaVA-7B -- 4. Results and analysis -- Inference -- Discussion -- 5. Conclusion and future work -- Implications of the study -- Future research -- Appendix supplementary materials -- Image description quality: BLIP-2 vs. ViT -- Model training -- Comparative reasoning examples.
References: Includes bibliographical references (pages 35-38).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Master of Science (MS)
Degree in: Data Science
Academic Unit: Department of Computer and Information Science
Language: English
Resource Type: Thesis
DOI: https://doi.org/10.62791/20487
Record Identifier: 9914504464001301