Abstract
Indexes are fundamental components of database management systems, traditionally implemented through structures like B-Tree, Hash, and BitMap indexes. These index structures map keys to data records, optimizing search efficiency within databases. Recent advancements in machine learning have introduced the concept of learned indexes, where models such as neural networks predict the position or existence of records based on the learned data distribution. This exploratory research posits that traditional index structures can be replaced with learned models, potentially offering significant performance improvements. Initial findings indicate that neural network-based indexes can outperform cache-optimized B-Trees in speed while reducing memory usage across various real-world datasets. The growing frequency of data breaches and the necessity for strong privacy safeguards are crucial factors to consider in database management systems. An exemplary instance is the 2013 Yahoo data breach, widely regarded as one of the most significant in history. Malicious individuals leveraged a vulnerability in Yahoo’s cookie infrastructure to get unauthorized access to the personal information, including names, birthdates, email addresses, and passwords, of the entire user base consisting of 3 billion Yahoo accounts. The complete magnitude of the security breach was disclosed in 2016 while Verizon was in the process of acquiring the company, resulting in a decrease of USD 350 million in Verizon’s proposed offer. This occurrence highlights the pressing requirement for improved data security and privacy in the management of databases. Concurrently, the discipline of machine learning encounters the task of balancing the retrieval of valuable data with the safeguarding of confidentiality. Differential privacy has become a strong framework for safeguarding the privacy of individual data while also allowing for data analysis. This thesis investigates the incorporation of differential privacy into learnt index structures, analyzing machine learning algorithms that protect privacy and mechanisms for releasing data depending on learning. We examine the theoretical boundaries of differential privacy within the realm of machine learning, encompassing the maximum values of loss functions for algorithms that ensure differential privacy. The intersection of learned indexes and differential privacy presents unique challenges and opportunities. This research addresses key issues such as the incorporation of public data, handling missing data in private datasets, and the impact of differential privacy on the utility of machine learning algorithms as data volume increases. Our work aims to demonstrate that differentially private learned indexes can achieve comparable utility to non-private counterparts while ensuring robust privacy protections. This thesis provides a comprehensive overview of the potential for integrating learned indexes with differential privacy, paving the way for more secure and efficient data management systems.