This is a summary of two separate meeting with Prof. Alan Cox and Prof. Vivek Sarkar for feedback on the latest implementation in building a multi-core optimized Hadoop MapReduce runtime.
The meetings recorded discussions on
- Implementation Strategy for the multicore optimized runtime.
- Possible ways to further improve the performance of the runtime.
- Classification of two approaches to multi-core parallelism, task parallelism and intra task parallelism.
- Algorithm for choosing between the two approaches and their respective strengths and weaknesses.
- Classification of MapReduce data analytics applications based on their memory footprint, computation intensiveness, etc.
- Benchmark applications and find more of them
1. Implementation Strategy for the multicore optimized runtime
The implementation consists of mostly two parts. The first part is changing source code in the org.apache.hadoop.mapred package directly. Specifically, Child.java and JvmManager.java. We changed Child.java into a compute server style implementation and modified JvmManager to act almost like a client.
2. Possible ways to further improve the performance of the runtime
Currently the runtime is 10-14% slower when executing compute intensive tasks, using only task parallelism, no intra-task parallelism. We considered possible reasons for the slowdown, including
- Frequent thread migrations. It might be better if we just pin the threads to certain processors
- Greater garbage collection activities
- Log syncing conflicts, HDFS is not multithreaded
3. Classification of two approaches to multi-core parallelism, task parallelism and intra task parallelism
Approach 1: Task level parallelism
This is parallelism that was designed in the original Hadoop MapReduce runtime. A job is decomposed into a large number of map tasks and reduce tasks. The tasks work in parallel with each other.
In the original Hadoop MapReduce runtime, each task runs in its own JVM. As a result, there are usually at least 4-8 map JVMs running on one machine with 4-8 cores. However, in my optimized runtime, multiple map tasks can run in the same JVM. Thus, there is always just one JVM on each machine that is responsible for the execution of the tasks. This optimization allows sharing of in-memory data structures across different map tasks.
Approach 2: Intra Task Parallelism
This approach is taken in my previous implementation of HJ-Mapper and Multithreaded Mapper included in the Hadoop distribution. It focuses on compute intensive applications such as KMeans and KNN. It is useful for individual map task to be able to use multiple cores available. With reduced number of map tasks, the optimized runtime can also significantly reduce the memory footprint as duplication of data structures is reduced (because JVMs don’t have shared memory space)
4. Algorithm for choosing between the two approaches and their respective strengths and weaknesses
We don’t yet have a clear idea of what are the respective strengths of the two approaches.
Approach 1 might have a slight disadvantage in memory footprint because we have several map tasks running at the same time. However, this should not be significant. We need to answer this question, is there a significant memory drawback to task parallelism even if we have them execute in the same JVM? This is directed to the KNN, KMeans applications.
Appraoch 1 has the advantage of better overlapping IO/Computation. Multiple tasks can be at different stages of execution. As a result, when task is focusing on IO, another task could be focusing on Computation.
Approach 2 could have the upper hand in terms of raw memory footprint because there is only one task running.
Approach 2 would have issues with optimal overlapping Computation and IO because it takes a while for a map task to arrive at the point that it have enough sub tasks (sub tasks within a single map task) that can utilize all the cores.
I need to develop an algorithm that can analyze the performance characteristics of MapReduce applications and choose the right approach of parallelism for executing the application.
5. Classification of MapReduce data analytics applications based on their memory footprint, computation intensiveness, etc.
We need to survey the MapReduce data analytics applications and try to classify them into certain categories.
Previously, we classified the applications into “memory intensive” or “non memory intensive” applications. Hashjoin, KMeans, KNN all have a large read-only in memory data structures. Within the “memory intensive applications”, we further classify them into “compute intensive” and “non compute intensive” applications. The compute intensive applications.
Right now, I need to focus more on weather a task is IO intensive or Compute intensive. Is it possible that we create a large number of tasks that saturates the IO but not the CPU. The IO is limited by contention of a large number of read requests.
Additionally, I need to look into the size of per task memory. If each task has to have its own in-memory data structure that can’t be shared, then it becomes problematic to use task parallelism. (I need to take a look at the Mahout benchmark implementations)
6. Benchmark applications and find more of them
The last task is that we need to find more applications and rerun some of the previous tests. Prof. Sarkar suggested that I should have at least 6 applications to show the performance improvement.
I think I need re survey all the MapReduce applications out there and see if we can find more useful ones. May be talking with Max would be a good idea.