Abstract
As the capabilities of Machine Learning models grow, so does the cost of power consumption and time. In the era of deep learning, models have become larger and larger. While the complexity of these models leads to greater performance, they also require more computation to train. According to recent estimates, the state of the art transformer based models require 3-5 months to train, while consuming 50 gigawatt-hours of energy. While not all deep learning models are not this large, higher performing models do follow the trend of larger model size over time. As a consequence, this makes training high performing models on resource constrained devices more challenging. To address this problem, one can try to minimize training costs through a data-centric approach. Moreover, one can attempt to minimize the dataset size with the goal being to reduce training costs by limiting the amount of data needed for training with the goal of maintaining performance. There are two data-centric approaches that are used primarily in this work, dataset pruning and dataset condensation. Dataset pruning is an approach that removes less efficient training samples from a dataset, while data condensation is a technique that aims to condense the knowledge from a large dataset into a smaller, synthetic dataset while preserving the performance of models trained on it. In this work, we propose a couple of different methods that utilize both these ideas to minimize training costs when training a deep neural network. We utilize vector quantization and normalizing flow models to perform variations of distribution learning. Distribution learning is utilized in this work to guide the training process towards more data efficient samples and improve upon existing state of the art dataset pruning methods. Our results show significantly reduced training costs across multiple benchmark image classification datasets, while also maintaining classification performance and outperforming other state of the art dataset pruning methods.