Resource efficient distributed computing: a dissertation in Electrical Engineering

Yingjie Wang

doi:10.62791/20455

Back

Dissertation

Open access

Resource efficient distributed computing: a dissertation in Electrical Engineering

Yingjie Wang

Doctor of Philosophy (PHD), University of Massachusetts Dartmouth

2025

DOI:

https://doi.org/10.62791/20455

Abstract

There is a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. My research explores topics on machine learning and big data technologies related to learning under decentralized resources. One topic of distributed learning is to distribute large scale centralized computation to clustered or multi-core computers. We propose a method for fast computation of kNN search, random projection forests (rpForests). RpForests finds nearest neighbors by combining multiple kNN-sensitive trees with each constructed recursively through a series of random projections. RpForests has a very low computational complexity as a tree-based methodology and achieves a remarkable accuracy in terms of fast decaying missing rate of kNNs and that of discrepancy in the k-th nearest neighbor distances, as demonstrated on many datasets. The ensemble nature of rpForests makes it easily parallelized to run on clustered or multi-core computers; the running time is shown to be nearly inversely proportional to the number of cores or machines. Another two topics treats the data in machine learning as a computing resource. Existing learning algorithms typically assume all the data to be in one centralized place while it is increasingly often that the data are located at a number of distributed sites, and we wish to learn over data from all the sites with low communication overhead. Also, it is often that the data of interest has features shared by some other datasets from multiple sources. It is desirable to take advantage of such auxiliary datasets. We proposed two approaches under this topic—fast communication-efficient spectral clustering overdistributed data and fuzzy join of data with shared features. A novel framework is proposed that enables computation over data from all the physical nodes, with minimal communications overhead while a major speedup in computation for spectral clustering. The loss in accuracy is negligible compared to the non-distributed setting. The proposed approach allows local parallel computing at where the data are located and the speedup is most substantial when the data are evenly distributed across sites. Experiments show almost no loss in accuracy with our approach while a 2x speedup under various settings with two distributed sites. As the transmitted data does not need to be in their original form, the framework readily addresses the privacy concern for data sharing in distributed computing. We propose another efficient algorithm fuzzy join that enhances the learning from the provided data by leveraging the auxiliary data through shared features. Fuzzy join enables the extraction of additional information along the dimension implied by features in the auxiliary data that are not in the given data. Our implementation based on random projection forests is efficient with log linear computational complexity, and is resistant to noises in the data. Experiments demonstrate the practicality of our approach. Fuzzy join extends the scope of the join operation in relational databases by performing join on non-index key columns and allowing non-exact matches between rows from different datasets.

Files and links (1)

pdf

Wang Y. COE PhD Dissertation 20253.59 MBDownload View

CC BY-NC-ND V4.0, Open Access

Metrics

28 File views/ downloads

61 Record Views

Details

Title: Resource efficient distributed computing
Creators: Yingjie Wang
ORCID: 0000-0003-0327-0084
Contributors: Honggang Wang (Advisor) - University of Massachusetts Dartmouth, Department of Electrical and Computer Engineering
Donghui Yan (Advisor) - University of Massachusetts Dartmouth, Department of Mathematics
Liudong Xing (Committee Member) - University of Massachusetts Dartmouth, Department of Electrical and Computer Engineering
Ping Chen (Committee Member) - University of Massachusetts Boston
Number of pages: ix, 75 pages
Illustrations: illustrations (some color)
Table of contents: List of figures -- List of tables -- Introduction -- K-nearest neighbor search by random projection forests -- Spectral clustering over distributed data -- Fuzzy join of data with shared features -- Design and analysis of the algorithms -- A framework for spectral clustering on distributed data -- The design of rpForests -- Fuzzy join of data with shared features -- Releated work -- Fast communication-efficient spectral clustering over distributed data -- K-nearest neighbor search by random projection forests -- Fuzzy join of data with shared features -- Experiments -- Fast communication-efficient spectral clustering over distributed data -- K-nearest neighbor search by random projection forests -- Fuzzy join of data with shared feature -- Future work -- Conclusions -- References.
References: Includes bibliographical references (pages 63-73).
Awarding Institution: University of Massachusetts Dartmouth
Degree Awarded: Doctor of Philosophy (PHD)
Degree in: Electrical Engineering
Academic Unit: Department of Electrical and Computer Engineering
Language: English
Resource Type: Dissertation
DOI: https://doi.org/10.62791/20455
Record Identifier: 9914443624201301