Meeting Notes with Prof. John Mellor-Crummey

This is a summary of my meeting with Prof. John Mellor-Crummey on Monday Feb 3rd.  In the 1.5 hr meeting, I talked about the motivation, problem statement, implementation and results of my work on improving Hadoop MapReduces’ performance on multi-core systems. Prof. Mellor-Crummey provided a lot of helpful feedback on candidate JVM analyzation tools, components of the master thesis and future plans for the research project.  The summary is divided into the following components

  1. Components in my research that is of interest to Prof. Mellor-Crummey
  2. Possible tools for analyzing the heap utilization in JVM
  3. Making the parallel mapper, JVM more transparent
  4. Next steps in the research project and components of the Master Thesis

  • Components in my research that is of interest to Prof. Mellor-Crummey

Prof. Mellor-Crummey is interested in the research for two main reasons. The first is that they are working on using HPC Toolkit for performance analysis on Hadoop MapReduce runtime system. Understanding how Hadoop MapReduce works can be helpful. Secondly, the clustering algorithms, such as KMeans, could be used for clustering data generated by HPC tool kit.

  • Possible tools for analyzing the heap utilization in JVM

The first tool Prof. Mellor-Crummey suggested is SWAT from AMD. It appears that HadoopCL made by Max might have used the tool at one point.

The second tool is Visualvm, which looks like a popular JVM profiling tool. I plan to try it out today. I tlooks very promising.

Additional tools that people has recommended so far, “jmap”.

  • Making the parallel mapper, JVM more transparent

Currently the parallel mapper requires the user to lock any data structure that they are accumulating the results of map() into. For example, in the case of KMeans, a newClusterCentorids object. This is because there could be a write-write data race if multiple map() calls updates the data structure. It would be nice if we don’t have to impose the requirement for users to add a locking mechanism. 

Similarly, for the parallel JVM approach, the changes are more transparent. But we still need some locking during the initialization of the read-only data structure to make sure that only one thread initializes the data structure and no other thread uses it before it is being initialized. 

  • Next steps in the research project and components of the Master Thesis

Prof. Mellor-Crummey suggested a sequence for chaining the components of the project together. We start from analyzing / profiling the performance characteristics of popular MapReduce applications. Some of them are memory intensive, as in HashJoin, KNN Join, KMeans and other clustering algorithms. Next we move on to analyze why this is an issue (duplication of in-memory data structures across different task JVMs on the same machine).

Immediately, we introduce the first solution to the problem, the parallel Mappers with dynamic chunking. This reduces the number of mapper needed for a multicore machine to 1-2, increasing the available memory and reducing duplications. However, this only works for compute intensive applications such as KMeans. For more IO intensive application, such as HashJoin, the mapper approach is useless because not enough key,val pairs can be read in time.

As a result, we implemented a second approach to this problem. Instead of subdividing a single map task into smaller sub tasks that can be executed, I modified the Hadoop source code to have multiple map tasks execute in the same JVM. And if the user declares the data structure as static, then there will be only one copy in the memory of the JVM, reducing duplications. The best part of this approach is that it parallelizes reading fileInputSplit and deserialization of the input key value pairs. This approach enables executing applications that are less compute intensive and more IO bounded, such as hash join. 


Combining the two approaches, we can find that most memory intensive applications can use one approach or the other, or a hybrid of the two (parallel JVM with two or more parallel mappers. This way, there can be more overlap between IO and computation). The current work is aiming at finding the right configuration (par JVM + number of parallel mapper) for the applications we have, KMeans, KNN, Hashjoin. 

The future work, but also the interesting part is an algorithm that could potentially self adjust the configuration of the Hadoop MapRedcue system to achieve the best utilization of IO and Computation resources in multi-core nodes in the cluster. 


This entry was posted in Hadoop Research, HJ-Hadoop Improvements. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s