Meeting Notes with Prof. Cox and Prof. Sarkar (End of May, Starting June)

This is a post summarizing the two meetings from the end of May and the start of June.

We discussed about the deadlines for the paper, ToDo Items before the deadline and my plan in the next few months.

ToDo Items

  1. Create the paper repository with the text from the master thesis
    1. Need to rewrite the introduction a bit more
    2. Need to rewrite the related work
  2. Document the code I wrote, write the README
  3. Run experiments (128 cores)
    1. Get the code ready to run on EC2
      1. 8 nodes with 16 cores
      2. 4 nodes with 32 cores
      3. use my own credit card first
    2. STIC, larger scale tests, 16 GB memory
      1. 16 nodes on STIC
      2. (if the 16-node configuration not available), 8 nodes, back up, 64 cores
      3. trim the data points, so that we can run with less time
  4. Least priority future work item
    1. Eliminate the impact of local data structure that takes up a lot of memory (local results accumulator)
      1. The applications can be classified into three categories, and we are dealing with the last category
        1. IO bound
        2. CPU bound
        3. Accumulates large memory per map task (should minimize the number of map tasks)
      2. Implementation ideas
        1. Use a reference count to keep track if it is the last map task
        2. Use Habanero Java Accumulators
        3. Design it so that the changes can be transparent
    2. Make Compute Server more transparent
      1. implement a synchronized set up methods in the library (explore)


My own research into new related works


  • M3R


  1. Multithreads running in a single JVM, multiple map/reduce tasks in the same JVM
    1. however, it is not the focus of the system, the system focus on running map/reduce in the same process, shuffling, cross place communication, caching (X10)
    2. not concerned about the memory footprint
  2. Contributions
    1. in memory cache
    2. reduced shuffling overhead
  3. Transparent to Hadoop API
    1. rewrote the system but keeps the Hadoop API
  4. Related work
    1. very similar to Spark, but claims that it is compatible with Hadoop, Spark is not a MapReduce engine


  • Scale-ip vs Scale-out


  1. Same configuration of cores and memory, but saves the overhead of shuffling across the network
    1. performance/dollar
  2. Map intensive jobs see no difference in performance
  3. Can use up memory, memory is becoming an important resource
  4. Did a multithreaded extension for Hadoop, but sees the same performance


  • hone: scaling down Hadoop on shared memory


  1. very similar to the previous paper, little performance
  2. noting the memory overhead of using Java


  • Memory Footprint Matters: Efficient Equi-Join algorithms for main memory data processing


  1. “one must also consider the memory footprint of each join algorithm”
  2. multiple jobs could be running on the same node at the same time, putting more pressure on the memory
  3. hash join (map side join) has lower memory footprint than sort based joins (reduce side join)


  • Memory-Efficient GroupBy-Aggregate using Compressed Buffer Trees


  1. a very useful section on EC2 cost for memory
    1. for most instances, memory costs already dominate, in terms of the proportion of total instance cost.


This entry was posted in Hadoop Research. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s