How does Hadoop create job splits and assign them to tasks?

Coming back to the problem we are trying to solve here, trying to pass multiple InputSplits to one Map Task. It appears that the Task object stayed relatively intact after it is retrieved by the TaskTracker and assigned to the TaskRunner, the ChildJVM. As a result, it is possible that we don’t need to modify the mechanisms that the Task object is transferred around. Instead, if we could just locate the point that the Task object is created, we could added a list of input splits in there to increase the amount of input data a single map task is carrying.

It appears that we the split mapping to a task is stored in a maps[], the critical code block in JobInProgress.java


 

/**

* Construct the splits, etc.  This is invoked from an async

* thread so that split-computation doesn't block anyone.

*/

public synchronized void initTasks()

throws IOException, KillInterruptedException, UnknownHostException {

…...

TaskSplitMetaInfo[] splits = createSplits(jobId);

if (numMapTasks != splits.length) {

throw new IOException("Number of maps in JobConf doesn't match number of " +

"recieved splits for job " + jobId + "! " +

"numMapTasks=" + numMapTasks + ", #splits=" + splits.length);

}

numMapTasks = splits.length;

 

// Sanity check the locations so we don't create/initialize unnecessary tasks

for (TaskSplitMetaInfo split : splits) {

NetUtils.verifyHostnames(split.getLocations());

}

jobtracker.getInstrumentation().addWaitingMaps(getJobID(), numMapTasks);

jobtracker.getInstrumentation().addWaitingReduces(getJobID(), numReduceTasks);

this.queueMetrics.addWaitingMaps(getJobID(), numMapTasks);

this.queueMetrics.addWaitingReduces(getJobID(), numReduceTasks);

 

maps = new TaskInProgress[numMapTasks];

for(int i=0; i < numMapTasks; ++i) {

inputLength += splits[i].getInputDataLength();

maps[i] = new TaskInProgress(jobId, jobFile,

splits[i],

jobtracker, conf, this, i, numSlotsPerMap);

}

This is potentially where we are going to start modifying the Hadoop source code to map one task to multiple splits.

This seems to be the easy way to go.

Advertisements
This entry was posted in Hadoop Research, HJ-Hadoop Improvements, Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s