Logo image
Improving computational efficiency via tiny machine learning data-centric methods: a thesis in Data Science
Thesis   Open access

Improving computational efficiency via tiny machine learning data-centric methods: a thesis in Data Science

Michael Soricelli
Master of Science (MS), University of Massachusetts Dartmouth
2025
DOI:
https://doi.org/10.62791/20484

Abstract

As the capabilities of Machine Learning models grow, so does the cost of power consumption and time. In the era of deep learning, models have become larger and larger. While the complexity of these models leads to greater performance, they also require more computation to train. According to recent estimates, the state of the art transformer based models require 3-5 months to train, while consuming 50 gigawatt-hours of energy. While not all deep learning models are not this large, higher performing models do follow the trend of larger model size over time. As a consequence, this makes training high performing models on resource constrained devices more challenging. To address this problem, one can try to minimize training costs through a data-centric approach. Moreover, one can attempt to minimize the dataset size with the goal being to reduce training costs by limiting the amount of data needed for training with the goal of maintaining performance. There are two data-centric approaches that are used primarily in this work, dataset pruning and dataset condensation. Dataset pruning is an approach that removes less efficient training samples from a dataset, while data condensation is a technique that aims to condense the knowledge from a large dataset into a smaller, synthetic dataset while preserving the performance of models trained on it. In this work, we propose a couple of different methods that utilize both these ideas to minimize training costs when training a deep neural network. We utilize vector quantization and normalizing flow models to perform variations of distribution learning. Distribution learning is utilized in this work to guide the training process towards more data efficient samples and improve upon existing state of the art dataset pruning methods. Our results show significantly reduced training costs across multiple benchmark image classification datasets, while also maintaining classification performance and outperforming other state of the art dataset pruning methods.
pdf
Soricelli M. COE MS Thesis 20251.43 MBDownloadView
CC BY-NC-ND V4.0 Open Access

Metrics

163 File views/ downloads
17 Record Views

Details

Logo image