How does input split get passed from Task tracker to Child JVM task execution

In the last post, we traced the process a task is assigned to a Tasktracker -> launch new JVM (, TaskTracker.TaskController)-> execute the assigned task in the ChildJVM. (

Now, we delve into how the map and reduce task get its input split. This time we trace it backwards from LineRecordReader.initialize() method back to the TaskSplitIndex init in the runNewMapper method

void runNewMapper(final JobConf job,

final TaskSplitIndex splitIndex,

final TaskUmbilicalProtocol umbilical,

TaskReporter reporter


This leads to the run method in mapTask, it references an environment variable splitMetaInfo, which is set in the constructor of the Map Task

public MapTask(String jobFile, TaskAttemptID taskId,

int partition, TaskSplitIndex splitIndex,

int numSlotsRequired) {

super(jobFile, taskId, partition, numSlotsRequired);

this.splitMetaInfo = splitIndex;


Following this lead, we use the information from last post and go back one more level to

// Create a final reference to the task for the doAs block

final Task taskFinal = task;

childUGI.doAs(new PrivilegedExceptionAction<Object>() {


public Object run() throws Exception {

try {

// use job-specified working directory

FileSystem.get(job).setWorkingDirectory(job.getWorkingDirectory());, umbilical);

We keep digging in to see how we got the task information in the first place

JvmTask myTask = umbilical.getTask(context);

if (myTask.shouldDie()) {


} else {

if (myTask.getTask() == null) {

taskid = null;

currentJobSegmented = true;


if (++idleLoopCount >= SLEEP_LONGER_COUNT) {

//we sleep for a bigger interval when we don't receive

//tasks for a while


} else {






idleLoopCount = 0;

task = myTask.getTask();

This indicates that the task information is got through remote method invocation system built on top of umbilical interface. The Task was created out of a JvmTask object. We go to one implementation in TaskTracker.

The way you get the task is to go to JVM Map Manager and get it

synchronized public TaskInProgress getTaskForJvm(JVMId jvmId)

throws IOException {

if (jvmToRunningTask.containsKey(jvmId)) {

//Incase of JVM reuse, tasks are returned to previously launched

//JVM via this method. However when a new task is launched

//the task being returned has to be initialized.

TaskRunner taskRunner = jvmToRunningTask.get(jvmId);

JvmRunner jvmRunner = jvmIdToRunner.get(jvmId);

Task task = taskRunner.getTaskInProgress().getTask();



return taskRunner.getTaskInProgress();



return null;


This entry was posted in Hadoop Research. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s