Classifying the MapReduce Data Analytics Applications (Part I)

This is the first part of a series of posts on analyzing the performance characteristics of common MapReduce data analytics applications. Most of the applications are chosen from Apache Mahout project and PUMA benchmark developed at Purdue University http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1438&context=ecetr.

List the applications by their functions

  1. Clustering Algorithms
    • KMeans, Fuzzy Kmeans, Canopy clustering, LDA (latent dirichlet allocation)
    • Features:
      1. Memory Intensive: The applications need a large amount of memory to hold the parameters. In the case of clustering algorithms, the parameters are he cluster centroids. The size of the parameters is related to the total number of clusters the algorithm is trying to generate. An analysis of the implementation of the class of clustering algorithm in the Apache Mahout Package can be found at https://yunmingzhang.wordpress.com/2014/01/13/apache-mahout-kmeans-implementation/
      2. Compute Intensive: There are a lot of computation involved in calculating the similarity of two vectors (distance measure calculation).
  2. Classification Algorithms
    • K Nearest Neighbor. It is similar to a join.
    • Features
      1. Memory Intensive: It stores a in-memory parameter. In the case of KNN, it loads the smaller input vectors in the memory and streams through the training data.
      2. Compute Intensive: Again, the applications involve calculating the similarity of two vectors by doing vector products.
  3. Dataset Join Algorithm
    • Hash Join
    • Features
      1. Memory Intensive: It loads in the smaller look up table in the memory. Even though it is relatively “small”, the in-memory data structure can still take a lot of memory
      2. IO Intensive: The in-memory table only need to be loaded into memory once. Since the map task simply queries the hash table, the computation is not intensive.
  4. Sort data (IO intensive)
    • Terasort
  5. Term Vector  (IO intensive): determines the most frequent words in a host and is useful in analyses of a host’s relevant to a search
  6. Inverted-index: takes a list of documents as input and generates word-to-document indexing.
  7. Histogram (IO intensive): generates a histogram of input data and is a generic tool used in many data analyses.
  8. Grep (IO intensive): find a certain word
  9. Word Count (IO intensive): find the count of all words
  10. Ranked Inverted Index (IO intensive)
  11. Other benchmarks (Compute Intensive)
    • Pi
    • n-body
    • Black Scholes
Advertisements
This entry was posted in Algorithms, MapReduce Algorithms. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s