Improving computational efficiency via tiny machine learning data-centric methods: a thesis in Data Science

Michael Soricelli

doi:10.62791/20484

Back

Thesis

Open access

Improving computational efficiency via tiny machine learning data-centric methods: a thesis in Data Science

Michael Soricelli

Master of Science (MS), University of Massachusetts Dartmouth

2025

DOI:

https://doi.org/10.62791/20484

Abstract

As the capabilities of Machine Learning models grow, so does the cost of power consumption and time. In the era of deep learning, models have become larger and larger. While the complexity of these models leads to greater performance, they also require more computation to train. According to recent estimates, the state of the art transformer based models require 3-5 months to train, while consuming 50 gigawatt-hours of energy. While not all deep learning models are not this large, higher performing models do follow the trend of larger model size over time. As a consequence, this makes training high performing models on resource constrained devices more challenging. To address this problem, one can try to minimize training costs through a data-centric approach. Moreover, one can attempt to minimize the dataset size with the goal being to reduce training costs by limiting the amount of data needed for training with the goal of maintaining performance. There are two data-centric approaches that are used primarily in this work, dataset pruning and dataset condensation. Dataset pruning is an approach that removes less efficient training samples from a dataset, while data condensation is a technique that aims to condense the knowledge from a large dataset into a smaller, synthetic dataset while preserving the performance of models trained on it. In this work, we propose a couple of different methods that utilize both these ideas to minimize training costs when training a deep neural network. We utilize vector quantization and normalizing flow models to perform variations of distribution learning. Distribution learning is utilized in this work to guide the training process towards more data efficient samples and improve upon existing state of the art dataset pruning methods. Our results show significantly reduced training costs across multiple benchmark image classification datasets, while also maintaining classification performance and outperforming other state of the art dataset pruning methods.

Files and links (1)

pdf

Soricelli M. COE MS Thesis 20251.43 MBDownload View

CC BY-NC-ND V4.0, Open Access

Metrics

163 File views/ downloads

17 Record Views

Details

Title: Improving computational efficiency via tiny machine learning data-centric methods
Creators: Michael Soricelli
ORCID: 0009-0008-4570-9618
Contributors: Yuchou Chang (Advisor) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Christopher Hixenbaugh (Committee Member) - Naval Undersea Warfare Center
Long Jiao (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Number of pages: x, 40 pages
Illustrations: illustrations (some color)
Table of contents: List of figures -- List of tables -- Chapter 1. Introduction -- Background -- Computational challenges -- Data-centric approach -- Thesis contribution -- Chapter 2. Related works -- Dataset pruning -- Dataset condensation -- Vector quantization -- Normalizing flows -- Chapter 3. Methods -- Methodology A: Data efficient training via feature distillation -- Methodology B: Uncertainty based pruning with density estimation -- Chapter 4. Experimental results -- Methodology A results -- Methodology B results -- Chapter 5. Discussion and conclusion -- Data efficient training via feature distillation -- Uncertainty based pruning with density estimation -- Future directions -- Conclusion -- References.
References: Includes bibliographical references (pages 38-40).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Master of Science (MS)
Degree in: Data Science
Academic Unit: Department of Computer and Information Science
Language: English
Resource Type: Thesis
DOI: https://doi.org/10.62791/20484
Record Identifier: 9914504161201301