Using Static Variables to Improve Memory Efficiency in Hadoop MapReduce Runtime

The post documents my attempt to share memory data structures across multiple map tasks running in the same JVM. I first introduce what is a “static” variable in Java. Then I give examples of how I modified the MapReduce programs (mapper implementation) to reduce latency, improve memory efficiency. Some of the techniques can be used in the original sequential Hadoop MapReduce system as well.

Static Variables

Static variables are variables associated with a certain class instead of a specific instances of the class. It is often used for constants shared in all instances of the class. In these cases, it would often be paired with “final”, which means the variable cannot be modified after the initial assignment.

For example, if we are doing mathematical calculation. Each circle class need to use the constant value Pi, we would do something like below

public class Circle {

static final double Pi = 3.1415;


double getArea(){ return Pi*r*r;}


The static variable is in a fixed memory location (different from class variables) and there is only one copy of the variable. Static variables are referenced by class names, in this case,

Circle.Pi instead of circle1.Pi. (circle1 is an instance of the Circle class)

Static variables are also often used to share variables across threads. A lot of cases, the variables shared across threads also need to be declared as “Static Final”. This way, there won’t be concurrent writes to the variable, since it can be written once only. Thus, data race is prevented.

Actually, the static modifier can be used not only for variables, but also for classes, methods. A common use for static methods is accessing static variables.

More references can be found here

A good stackoverflow post explains why static variables are shared cross threads. Static variables are stored in the heap. Each thread has its own thread but shares the heap space.

Using Java Static Variables in Map Tasks

Since each map task is executed through a call to method, we could declare shared, read-only variables used in map tasks as static. With the compute server implementation I produced, multiple mappers are running in the same JVM simultaneously. As a result, the static variables are shared across map tasks automatically. This is the cleanest approach I came up with.

Results show that the memory footprint of certain common MapReduce applications are significantly reduced using this approach, such as HashJoin. My initial results show that the compute server implementation can process 4 times more data than the original sequential Hadoop when executing HashJoin in MapReduce.

In addition, using static variables while turning on the JVM REUSE could be a powerful combination. In the set up phase, the application sometimes need to load in certain data such as a look up table or topic clusters. Once we made the loaded data structure static, we can make sure that we only load the data one time, eliminating future loads for all other map tasks in the same Job. This can save a lot of time for MapReduce applications that spend a lot of time on loading data, such as HashJoin. The only note here is that you might need to use locks to make sure there is only one thread initializing the data structure. (not a problem in original hadoop implementation, but a problem in my Compute Server based implementation).

This entry was posted in Hadoop Research, HJ-Hadoop Improvements, Java. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s