The first part of the post introduces the K Nearest Neighbor Algorithm, a popular classification algorithm used in data analytics. The second part describes the MapReduce implementation.
Quote the paragraph below from Wikipedia,
- In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
K Nearest Neighbor (KNN) Join takes in two data sets R and S. It compares every point in R against every point in S, and outputs results based on all pairwise comparisons, which leads to a complexity of O(|R| |S|) . KNN Join is representative of many large scale data analytics applications that examine interactions among different large data sets such as in-memory hash-joins, spatial range joins, and similarity-based search in databases. For example, the KNN Join is like the Fragment Replicated Join in PIG and Map Side Join in Hive.
Since the two data sets are too large to fit in one node, the MapReduce KNN Join application has to divide up the data sets to process them across machines in parallel. It is often true that the size of one table S is much larger than the other data set R. The current MapReduce implementation of KNN Join broadcasts the smaller R to every map task and divides up the larger S to stream each datum through the mappers.
In the map stage, the application loads the smaller data set R into the memory of each map
task. The Hadoop MapReduce system then uses the larger data set S as input to the
MapReduce job. The Hadoop MapReduce runtime splits up S into small
pieces and uses them as input to each map task. The map function calculates the distance
between two data points in R and S. The memory footprint of the mapper
JVM is large because the data structure containing R takes up a lot of memory
R can’t be garbage collected because every (key,val) pair needs to compare
against it. In the reduce stage, the application goes through each data point in R
and chooses the closest K data points in S. The optimized algorithm uses very
little time during the reduce phase.