This is a summary of discussion with Prof. Alan Cox on the design of a possible Compute Server in Hadoop MapReduce system on 11/26/2013.
The discussion focused on analyzing the relevant code segments in the Child.Java and TaskTracker.java. We outlined a plan for changing the Child.java implementation and noted a few work items to be completed by next meeting.
A figure demonstrating the interactions between the TaskTracker, which runs in its own JVM and ChildJVM, which executes a Map/Reduce task in its own JVM.
A short summary is listed below (TT) for TaskTracker, (CJ) for ChildJVM
The control flow looks like
- (TT) TaskTracker acquire a task from JobTracker
- (TT) Maps the Task to a JVM Id
- (TT) Launch a JVM with a large number of command line arguments
- (CJ) Read in the command line arguments and set up a RPC connection with the TaskTracker (Umbilical protocol)
- (CJ) Use the RPC connection (Umbilical Protocol) to get the task that was assigned to its JVMId back from the TaskTracker
- (TT) Find the relevant task mapped to the ChildJVM with its JVMId
- (CJ) Get back the assigned task and start, finish execution
What we are trying to do?
The ChildJVM is designed to execute tasks assigned only to a specific JVMId. It only executes one active task at a time. With JVM Reuse enabled, a ChildJVM can execute multiple Map/Reduce tasks one by one. What we are trying to do is to enable the ChildJVM to execute multiple tasks simultaneously, effectively making it a compute server.
How do we want to do it?
The current approach that we are considering it to have multiple tasks each with different assigned JVMId, execute in the same JVM. Effectively making the JVMId useless.
What are some specific implementation issues?
- We need to establish a connection from TaskTracker to the ChildJVM. In this way, when there is a task available for execution, the TaskTracker can let the ChildJVM know and pass it to the ChildJVM.
- We need to divide up the code in the ChildJVM into two sections, one section for a “manager thread’, which is in charge of 1) receiving task from TaskTracker 2)initializing a task and a “worker thread”, which is in charge of executing the task
- How to strip away all the task information when launching a JVM (now that a JVM is not launched for a single task or JVMId). A simple way to do it would be removing certain command line arguments (JVMid, Port, Host, TaskAttemptId, etc.) and use the RPC from TaskTracker to ChildJVM connection to pass in the relevant information.
- How to set up the TaskTracker to the ChildJVM connection. A simple one would be setting up a hardcoded port number. There are usually just two ways of setting up this type of connection
- TCP ports
- Unix Domain Socket