Abstract
There is a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. My research explores topics on machine learning and big data technologies related to learning under decentralized resources. One topic of distributed learning is to distribute large scale centralized computation to clustered or multi-core computers. We propose a method for fast computation of kNN search, random projection forests (rpForests). RpForests finds nearest neighbors by combining multiple kNN-sensitive trees with each constructed recursively through a series of random projections. RpForests has a very low computational complexity as a tree-based methodology and achieves a remarkable accuracy in terms of fast decaying missing rate of kNNs and that of discrepancy in the k-th nearest neighbor distances, as demonstrated on many datasets. The ensemble nature of rpForests makes it easily parallelized to run on clustered or multi-core computers; the running time is shown to be nearly inversely proportional to the number of cores or machines. Another two topics treats the data in machine learning as a computing resource. Existing learning algorithms typically assume all the data to be in one centralized place while it is increasingly often that the data are located at a number of distributed sites, and we wish to learn over data from all the sites with low communication overhead. Also, it is often that the data of interest has features shared by some other datasets from multiple sources. It is desirable to take advantage of such auxiliary datasets. We proposed two approaches under this topic—fast communication-efficient spectral clustering overdistributed data and fuzzy join of data with shared features. A novel framework is proposed that enables computation over data from all the physical nodes, with minimal communications overhead while a major speedup in computation for spectral clustering. The loss in accuracy is negligible compared to the non-distributed setting. The proposed approach allows local parallel computing at where the data are located and the speedup is most substantial when the data are evenly distributed across sites. Experiments show almost no loss in accuracy with our approach while a 2x speedup under various settings with two distributed sites. As the transmitted data does not need to be in their original form, the framework readily addresses the privacy concern for data sharing in distributed computing. We propose another efficient algorithm fuzzy join that enhances the learning from the provided data by leveraging the auxiliary data through shared features. Fuzzy join enables the extraction of additional information along the dimension implied by features in the auxiliary data that are not in the given data. Our implementation based on random projection forests is efficient with log linear computational complexity, and is resistant to noises in the data. Experiments demonstrate the practicality of our approach. Fuzzy join extends the scope of the join operation in relational databases by performing join on non-index key columns and allowing non-exact matches between rows from different datasets.