Efficient Implementation of Coreset-based K-Means Methods
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2021-06-14
Department
Major/Subject
Xiaobo Wu
Mcode
SCI3044
Degree programme
Master’s Programme in Computer, Communication and Information Sciences
Language
en
Pages
53
Series
Abstract
The problem of efficiently extracting information from big data has received significant attention in both academia and industry. In the last decades, many computational models like distributed/parallel computing, streaming algorithms and dynamic algorithms have been proposed to cope with big data. These methods were commonly used nowadays and have obtained great performances in many scenarios related to dealing with a massive amount of data. However, they are algorithm-driven approaches which means you need to generate new ideas every time, and it still will be constrained by the growing data size. Recently, a data-driven approach named coreset was developed to generate a much smaller summary to the original big data and directly apply the existing algorithms in the small dataset to reach an acceptable result in a very decent time and space complexity. In the latest research, the size of coreset $|S|$ can be handle to uncorrelated with the original data size $|D|$, which means unlimited by the size of data. My main result is an efficient implementation of an open-source coreset-based modern computational framework based on C++ and CUDA programming language in this thesis project. Specifically, we implement a coreset library that can construct a small coreset for arbitrary shapes of numerical data with a decent time cost. My job was mainly to follow the research proposed by Braverman et al. (SODA 2021). The experiments were conducted to benchmark with the CuML library, a well-known collection of the efficient GPU implementation of various machine learning algorithms. The benchmark results prove our implementation of the coreset method can lead to a faster speed in solving the k-means problem among the big data with high accuracy in the meantime. In addition, by utilizing the composable property of the coreset, our merge-and-reduce version of the coreset breaks through the restriction of GPU memory.Description
Supervisor
Jiang, ShaofengThesis advisor
Jiang, ShaofengKeywords
coreset, k-means, cuda, cuml, c++, machine learning