Initial Hadoop Compute Server Prototype Notes

This is a summary of what I did and learned through an implementation establishing a connection between the TaskTracker and ChildJVM. It focuses on the approach I took and interesting stuff I learned. We tested the grep application that come with the Hadoop distribution. It also outlined some issues that I need to fix in the future, and my plan for next implementations.

Updates: I added an analysis on the blocking call to exec shell arguments below. It explains why it is a blocking runChild call.

The files I modified, JvmManager.java and Child.java in the org.apache.hadoop.mapred package.

  1. Debug techniques
    • So far, the easiest way to debug is just using “System.out.println (message)” in the code.
      • For print statements in JvmManager.java, the output is in a file under log directory hadoop-yz17-tasktracker.out (it contains all the output to stdout in processes running in TaskTracker)
      • For print statements in Child.java, it is inside logdir/userlogs/attempt/stdout
    • Experiences: It is important to insert many statements, so we know where is working and where is not. So far, I don’t see an easy way to attach debug processes.
  2. JvmManager.java modifications
    • Goals: design a mechanism to (1) start the JVM from command line (2) set up a communication channel with the JVM (3) send in task related arguments through the channel
    • Implementation
      • I use a separate thread to launch establish a server socket and hard code a port number to the Child JVM. The main thread continue to launch the JVM
 Thread t = new Thread(){

public void run(){

System.out.println("JvmManager: in the thread, about to lauch the ChildJVM");
 try{

Thread.sleep(7000);

 

InetAddress addr = InetAddress.getLocalHost();
 String hostName = addr.getHostName();
 final int computeServerPortNumber = 40103;

Socket clientSocket = new Socket(hostName, computeServerPortNumber);
 PrintWriter out =new PrintWriter(clientSocket.getOutputStream(), true);
 BufferedReader in = new BufferedReader(
 new InputStreamReader(clientSocket.getInputStream()));

System.out.println("JvmManager: writing from client to server");

out.println(taskAttemptIdStr + " " + localJvmIdStr);
 }catch (Exception e){
 System.out.println("JvmManager: Error in launchign task in the thread");
 }

}

};

t.start();

After that the main thread continue to launch the operation. The reason that I had to use a separate thread is that it appears that this launch JVM is a blocking call. If I launch the JVM first before I send in the arguments, the ChildJvm would be stuck at

serverSocket.accept()

waiting for a connection to be established. When the ChildJVM is stuck, the original main thread for runChild is also stuck, resulting in a deadlock. With the separate thread, even when the main thread in RunChild is blocked, it could make progress once the separate thread made an connection and send in the task relevant arguments (taskAttemptID and JvmID)

  1. Child.java modifications

The changes focused on adding a server side in the Child.java file.


System.out.println("ChildJVM: ready to start up server socket");
 //set up server socket on port 54311
 final int portNumber = 40103;
 ServerSocket serverSocket = new ServerSocket(portNumber);

System.out.println("ChildJVM: after serverSocket object init, before accept\
");

Socket clientSocket = serverSocket.accept();
 PrintWriter out = new PrintWriter(clientSocket.getOutputStream(), true);
 BufferedReader in = new BufferedReader(new InputStreamReader(clientSocket.g\
etInputStream()));

String inputLine;
 String[] jvmArgs = new String[2];

System.out.println("ChildJVM: before while loop waiting for content");
 while((inputLine = in.readLine()) != null){
 System.out.println("ChildJVM: inputLine is " + inputLine);
 jvmArgs = inputLine.split(" ");
 break;
 }

 

System.out.println("ChildJVM: after while loopgot the content");

final TaskAttemptID firstTaskid = TaskAttemptID.forName(jvmArgs[0]);
 int jvmIdInt = Integer.parseInt(jvmArgs[1]);
 JVMId jvmId = new JVMId(firstTaskid.getJobID(),firstTaskid.isMap(),jvmIdInt\
);

System.out.println("ChildJVM: jvmId is: " + jvmArgs[1]);
 System.out.println("ChildJVM: attempt id is: " + jvmArgs[0]);

  • Next steps
    • Hard code Port issues: if there is a map and reduce JVM running simultaneously, then two the port number would be problematics. There must be a way to pass in the port number of argument to the JVM and it should be different for Map and Reduce tasks. May be this would not be a problem if no Map and Reduce JVMs are present at the same time. We need to do a test run with grep again, enabling the reduce phase.
    • The bigger problems is enabling multiple tasks to be running in the same JVM. Currently, we set the number of map slots on each TaskTracker to 1. This means no two tasks will be assigned simultaneously to the same JVM. (need some verifications). To deal with this, there are two ways
      • Create fake JVMIds that all map to the same JVM and manage that.
        • Need to find out where the JVMIds are generated
      • Assign multiple tasks simultaneously to the same JVM
        • Could be potentially problematic as we have to change task assignment algorithms and progress report potentially.
    • The blocking issue with runChild
      • runChild() method is called by run() method in JvmRunner, which is invoked when JvmRunner.start() is called. It is executed in a separate thread. The call to JvmRunner.start() is made in spawnNewJvm() in JvmManager. The code segment with documentation is here

private void spawnNewJvm(JobID jobId, JvmEnv env,

TaskRunner t) {

JvmRunner jvmRunner = new JvmRunner(env, jobId, t.getTask());

jvmIdToRunner.put(jvmRunner.jvmId, jvmRunner);

//spawn the JVM in a new thread. Note that there will be very little

//extra overhead of launching the new thread for a new JVM since

//most of the cost is involved in launching the process. Moreover,

//since we are going to be using the JVM for running many tasks,

//the thread launch cost becomes trivial when amortized over all

//tasks. Doing it this way also keeps code simple.

jvmRunner.setDaemon(true);

jvmRunner.setName("JVM Runner " + jvmRunner.jvmId + " spawned.");

setRunningTaskForJvm(jvmRunner.jvmId, t);

LOG.info(jvmRunner.getName());

jvmRunner.start();

}

Advertisements
This entry was posted in Hadoop Research, HJ-Hadoop Improvements, Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s