This is a post introducing KMeans Clustering Algorithm and explaining the MapReduce implementation of the algorithms.
KMeans is a classical clustering algorithm. Clustering algorithms classify a given data set into a number of clusters. Let’s put the algorithm in the context of an application. Given all the news that come out today, the algorithm group news that belong to the same topic, such as sports, politics and movies.
The idea behind KMeans clustering is to
- First define k centroids (can be randomly chosen, but the centroids should be far away from each other if possible).
- The next step is to take each point belonging to a given data set and associate it to the nearest centroid.
- Now that we have all the points assigned to one of the k clusters, we recalculate the centroids of the k-clusters.
- Go back to step 2 with the updated centroids location. Repeat the step 2 and 3 until the centroids no longer move.
To implement the algorithm in MapReduce fashion,
In Map phase, each map task read in a shared file containing all the cluster centroids. The cluster centroids file can be typically stored in a Hadoop distributed cache. Each map task read in a small slice of input data points. For each data point, it assign it to the cluster whose centroid is closest to the current data.
An optimization that can be done is we create a set of k centroids that store the new coordinates. As a result, data points in the same map task are accumulated in the new cluster centroids locally. This reduces the total number of output key,val pairs of the map task from NumberOfDataPoints to K (number of clusters). A reduction in the number of output key,val pairs will lead to a significantly shorter shuffle time.
In the reduce phase, the centroid coordinates outputted from each Map Task will be further reduced.
Multiple MapReduce jobs are chained together to simulate the multiple iterations of the algorithm.
- Prof. Chris Jermaine’s summer Big Data Insititute Program