Differential privacy learned index: a thesis in Data Science

Tilak Mudgal

doi:10.62791/20348

Back

Thesis

Open access

Differential privacy learned index: a thesis in Data Science

Tilak Mudgal

Master of Science (MS), University of Massachusetts Dartmouth

2024

DOI:

https://doi.org/10.62791/20348

Abstract

Indexes are fundamental components of database management systems, traditionally implemented through structures like B-Tree, Hash, and BitMap indexes. These index structures map keys to data records, optimizing search efficiency within databases. Recent advancements in machine learning have introduced the concept of learned indexes, where models such as neural networks predict the position or existence of records based on the learned data distribution. This exploratory research posits that traditional index structures can be replaced with learned models, potentially offering significant performance improvements. Initial findings indicate that neural network-based indexes can outperform cache-optimized B-Trees in speed while reducing memory usage across various real-world datasets. The growing frequency of data breaches and the necessity for strong privacy safeguards are crucial factors to consider in database management systems. An exemplary instance is the 2013 Yahoo data breach, widely regarded as one of the most significant in history. Malicious individuals leveraged a vulnerability in Yahoo’s cookie infrastructure to get unauthorized access to the personal information, including names, birthdates, email addresses, and passwords, of the entire user base consisting of 3 billion Yahoo accounts. The complete magnitude of the security breach was disclosed in 2016 while Verizon was in the process of acquiring the company, resulting in a decrease of USD 350 million in Verizon’s proposed offer. This occurrence highlights the pressing requirement for improved data security and privacy in the management of databases. Concurrently, the discipline of machine learning encounters the task of balancing the retrieval of valuable data with the safeguarding of confidentiality. Differential privacy has become a strong framework for safeguarding the privacy of individual data while also allowing for data analysis. This thesis investigates the incorporation of differential privacy into learnt index structures, analyzing machine learning algorithms that protect privacy and mechanisms for releasing data depending on learning. We examine the theoretical boundaries of differential privacy within the realm of machine learning, encompassing the maximum values of loss functions for algorithms that ensure differential privacy. The intersection of learned indexes and differential privacy presents unique challenges and opportunities. This research addresses key issues such as the incorporation of public data, handling missing data in private datasets, and the impact of differential privacy on the utility of machine learning algorithms as data volume increases. Our work aims to demonstrate that differentially private learned indexes can achieve comparable utility to non-private counterparts while ensuring robust privacy protections. This thesis provides a comprehensive overview of the potential for integrating learned indexes with differential privacy, paving the way for more secure and efficient data management systems.

Files and links (1)

pdf

Mudgal T. COE MS Thesis 2024716.61 kBDownload View

Open Access

Metrics

2 Record Views

Details

Title: Differential privacy learned index
Creators: Tilak Mudgal
ORCID: 0009-0000-7345-2436
Contributors: Yukui Luo (Advisor) - Binghamton University
Alfa R.H. Heryudono (Committee Member) - University of Massachusetts Dartmouth, Department of Mathematics
Long Jiao (Committee Member) - University of Massachusetts Dartmouth, Department of Computer and Information Science
Number of pages: x, 45 pages
Illustrations: illustrations (some color)
Language: English
Table of contents: List of figures -- List of tables -- Chapter 1. Introduction -- Background -- Problem statement -- Chapter 2. Literature survey -- Multilayer perceptrons in machine learning -- Neural networks -- Neural networks - University of Toronto -- Neural network architecture -- Chapter 3. Methods -- Data preparation of synthetic datasets -- Learned index framework -- Utilization of AMD Pro W7900 GPUs optimization -- Chapter 4. Implementation -- System architecture and model training -- Pipeline -- Comparison of synthetic data and real time -- Staged models for learned indexes -- Chapter 5. Results -- Accuracy and performance metrics -- Chapter 6. Conclusion and future work -- Chapter 7. The second appendix -- Data preprocessing steps -- Differential privacy integration -- Privacy mechanisms and performance optimization -- References.
References: Includes bibliographical references (pages 44-45).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Master of Science (MS)
Academic Unit: Department of Computer and Information Science
Degree in: Data Science
Resource Type: Thesis
DOI: https://doi.org/10.62791/20348
Record Identifier: 9914424893701301