Documentation on the scripts I used for STIC Cluster at Rice

This is a record for myself to document a series of scripts I wrote as I realized that I quickly forget how the scripts work.

  1. Compile and update Hadoop source file
  2. Analyze the log files for garbage collection information
  3. Record the CPU and memory utilization of the MapReduce applications
  4. Scripts that save the logging record of the MapReduce jobs

Compile and update Hadoop source file

The script is located in the

Profile the CPU and Memory Utilization of all the slave nodes in the cluster

The line in the benchmark script that makes a call to the scirpt

${HADOOP_HOME}/profile.sh $APP-$type-$try-$blocksize-$mappers 1 &

PROF_PID=$!

PROF_PID save the process number, so that we can kill the process after it finished profiling. During the lifetime of the script, it uses ssh command to send “top” command to all the nodes and get back the feedback.

The output logs are stored in the  ${HADOOP_HOME}/profile directory, find the right app-mapper name, there is one file in the job directory for each compute node. the output is recorded like below

ITER    TIME    CPU %   MEM%

1       .002382000 sec:

15.3    0.2

18.4    0.2

0.0     0.0

0.0     0.0

The first one is the iteration(one iteration per data collection, time is the time since the start, last two columns are the CPU and memory utilization of all Java Processes)

Analyze the profile data for all nodes (average across all nodes)

The basic analyzation script is parse.py in ${HADOOP_HOME}. It analyzes the profile output file described above. The parse.py file gives an average memory footprint of the java process with the largest memory footprint (a single mapper JVM process).

The output of the parse script will be saved in ${HADOOP_HOME}/outputs directory. The output is of the format

Max Heap Size 8192 MB Block Size 32 MB, Mappers 8

seq, Try 1

real    41m32.796s

cn-0099

Avg CPU: 724.276138637

Max CPU: 789.5

Avg Mem: 29.4955556034

Max Mem: 38.1

cn-0100

Avg CPU: 2.05916953666

Max CPU: 200.9

Avg Mem: 1.92263188933

Max Mem: 2.0

There is another script in ${HADOOP_HOME}/scripts/pscripts, parse-all-nodes.py. It basically analyzes the output from all nodes and take an average across all nodes.

Mange the log directory for each compute node

In the benchmark script, we move the logs directory into ${HADOOP_HOME}/logs, the subdirectory will be named with the current time-date, so that we can easily find the job (this has proven to be very useful from experience)

First step is build the log folder with the start time-date

file=”out-“`date +%T`

currentDate=`date +%m-%d-%y`

outputFolder=$HADOOP_HOME/outputs/out-$currentDate

logFolder=$HADOOP_HOME/logs/log-$currentDate

create a subfolder under the job directory for each compute node, mv the log folder into the right folder

for i in $COMP_NODES; do mkdir $newdir/$i; done

#for i in $COMP_NODES; do ssh $i “cp $LOG_PATH/*.log $newdir/i”;done

for i in $COMP_NODES; do ssh $i “mv  $LOG_PATH/* $newdir/$i”;done

Analyze the log directory for garbage collection information

An script is written to analyze the log directory, go into the stdout file for each attempt, and grep for “full GC” calls.

${HADOOP_HOME}/scripts/pscripts/parseGC.py  , it takes input as the job directory in the log directory

 

Advertisements
This entry was posted in Hadoop Research, Tools. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s