A study of explainable contrastive language-image pre-training models using perturbation-based analysis: a thesis in Computer Science

Ojonukpemi Felix Ameh

doi:10.62791/20414

Back

Thesis

Open access

A study of explainable contrastive language-image pre-training models using perturbation-based analysis: a thesis in Computer Science

Ojonukpemi Felix Ameh

Master of Science (MS), University of Massachusetts Dartmouth

2025

DOI:

https://doi.org/10.62791/20414

Abstract

Contrastive Language-Image Pre-training (CLIP) Models refer to Machine Learning models that learn by contrasting between correct and incorrect pairings of text and image, with broad initialized foundations from training on large datasets. They explore an emergent space in Artificial Intelligence research as they offer a way to map multiple modalities of data together, and exhibit impressive zero-shot predictions. An important part of advancing the viability of CLIP-like models is thorough analysis to characterize as much as possible of their inner workings and increase explainability. Pursuing explainability in particularly multimodal models, where decision processes are quite opaque, is an important part of continued improvements. It provides easily intelligible interpretations of their outputs and decision-making processes, and also aids their widespread adoption. CLIP, the widely used foundational contrastive model, serves as a benchmark for multimodal representation learning and makes it the ideal choice for explainability-focused analysis. This thesis will introduce an approach for conducting perturbation-based analysis, investigating how systematic modifications to input images affect their similarity scores and semantic alignment, and providing structured and comprehensible visual and statistical outputs. One experiment identifies inputs highly susceptible to specific perturbations, while another visualizes semantic drift across a controlled set of text labels, illustrating how perturbations change the conceptual understanding of input.

Files and links (1)

pdf

Ameh O.F. COE MS Thesis 20253.31 MBDownload View

CC BY-NC-ND V4.0, Open Access

Metrics

102 File views/ downloads

28 Record Views

Details

Title: A study of explainable contrastive language-image pre-training models using perturbation-based analysis
Creators: Ojonukpemi Felix Ameh
ORCID: 0000-0002-1416-2719
Contributors: Yuchou Chang (Advisor) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Firas Khatib (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Adnan El-Nasan (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Number of pages: x, 78 pages
Illustrations: illustrations (chiefly color)
Table of contents: List of figures -- List of tables -- Abbreviations -- Chapter 1. Introduction -- Explainability in AI -- Perturbation-based analysis -- Chapter 2. Related work -- Perturbation-based analysis for XAI -- Studying the robustness of CLIP -- Non-perturbation-based works on explainable CLIP or CLIP-like models -- Chapter 3. Proposed method -- Overview of the CLIP perturbation method breakdown -- Structure and implementation of the CLIP perturbation method -- Dataset, data flow, and data preprocessing -- Perturbation-based analysis -- Implementation of perturbation-based analysis -- Chapter 4. Results -- Overview of experiments -- Experiment 1 breakdown and discussion of results -- Experiment 2 breakdown and discussion of results -- Investigating brightness -- Supporting statistics -- Summary and future considerations -- Conclusion -- References.
References: Includes bibliographical references (pages 71-73).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Master of Science (MS)
Degree in: Computer Science
Academic Unit: Department of Computer and Information Science
Language: English
Resource Type: Thesis
DOI: https://doi.org/10.62791/20414
Record Identifier: 9914443626501301