15618 Project  parallel kmeans clustering
Team member: Robert Truong <hoangpht>
Project website
Project website is https://hoangphuoc25.github.io/15618project. The source code is on github
Summary
I will implement parallelized versions of the kmeans clustering algorithm on different targets (CPU/GPU) using different tools (OpenMP/ISPC/MPI/CUDA) and perform thorough analysis.
Background
Kmeans clustering is an important unsupervised machine learning method that is being used widely in data mining to group data points together and potentially find something in common among the data in the same group. The algorithm attempts to group data points using a loop with 2 steps:
 Assign a label to each data point based on the centroid nearest to it
 Calculate new centroids of each label using the points assigned in previous step
In practice, the number of data points can be very large and thus, step 1 will take a significant amount of time. However, it can be parallelized as there is no dependency between computation for each data point. In this project, I will attempt to parallelize the algorithm using OpenMP, MPI and CUDA.
Challenges
Naive kmeans doesn’t guarantee convergence to global optimum, often depending on the initialization of centroids. There are variations of the naive algorithm that makes it easier to choose centroids (e.g., kmeans++) and allows the algorithm to converge faster, and I’m planning to implement it as part of the optimization.
Another challenge is handling load imbalance. We have to handle all cases and aim to achieve a reasonable speedup in the case of heavily skewed data or lots of outliers. I plan to use different techniques/tools learned in class to reduce the communication and handle load better.
Specifically, the task of assigning a new label for each point in each iteration is more computationally expensive than the task of computing new centroid positions. There’s a few ways to parallelize it:
 Parallelize the loop with OpenMP or CUDA: for each data point, loop through all centroids and find the nearest centroid.
 (MPI/CUDA) Each process is responsible for one centroid. At the beginning of each iteration, each process communicates its centroid with other processes and calculates new boundaries, then uses new boundaries to calculate among the current points assigned to it, which points lie outside its region and send those to the correct process.
Resources
Hardware:
 I will use GHC machines to test the implementations using CUDA/OpenMP, and possibly PSC machines if time permits an MPI implementation
Software:
 I will implement the algorithm using C++ and OpenMP/CUDA/MPI
Test data:
 I will use publicly available data, e.g. this kaggle competition, to validate the correctness of the sequential implementation and use it as the baseline to measure speedup of parallel implementations.
Goals and deliverables

75% goal: minimum goal is to have a correct sequential implementation and complete parallel implementation with OpenMP and/or CUDA.

100% goal: I hope to achieve significant speedup compared to the sequential implementation, as well as a thorough analysis and comparison between the two implementations.

125% goal: if time permits and if I can find a large enough dataset, I hope to implement an MPI version and compare it against other parallel implementations.
Schedule
11/911/13: Finalize project topic.
11/1411/20: Research available literature, finish sequential implementation, generate large test cases for performance comparison.
11/2111/27: Complete first parallel implementation (OpenMP), and start optimization.
11/2812/3: Finish checkpoint, start second parallel implementation.
12/412/10: Finish second parallel implementation, start optimization and perform thorough analysis
12/1112/18: Finish report and demo
Updates:
 I will parallelize kmeans clustering algorithm for the project.
 The dataset used for performance benchmarking will be generated instead of collecting from public sources.
Project midterm report
The midterm report is available here