Abstract
With advancements in machine learning methods for image classification, models are getting larger and the need for large datasets is increasing. Although increased sizes of models and datasets lead to improved classification accuracy, it also comes at the cost of increased computational cost and energy consumption on edge devices to train the classification model. Data-centric approaches have been proposed to address this problem. An area of data-centric approaches focuses on pruning datasets to minimize the size of training data while maintaining model performance. Current State of the Art prediction uncertainty-based dataset pruning methods are effective, but they struggle when the target dataset contains noisy images. In this work, we adapt prior works that use data uncertainty calculations over the course of training to prune easy-to-learn data points from a large scale dataset and add density estimation utilizing normalizing flow models to encourage pruning of outlier data from the dataset to create a novel dataset pruning method. Using density estimation allows our method to perform uncertainty-based pruning, while also removing noisy images from the target dataset. Experimental results show improved accuracy up to 1.11% with moderate noise injection and 4.00% with high levels of noise injection with the weighted score-based approach.