This is a summary for meeting notes with Prof. Alan Cox on Feb 7th. We discussed some more results on KMeans application, improved heap memory utilization measurement, future plans. This document summarized the important parts of the meeting.
- Running time and heap memory utilization results for 6 different configurations benchmarks runs for the KMeans application
- Measuring the size ration between in-memory data structures for cluster centroids and the text cluster centroids file
- Next steps for the research project
I showed the running time for 30 MB, 63 MB and 100 MB cluster data text file size. The running time can be shown in the below graph and table
|runnign time in minues|
|seq JVM||par JVM|
|8 seq mapper(1024 M)||1 par mapper (8192 M)||2 mapper (4048 M)||8 seq mapper||1 par mapper||2 par mapper|
Running time Result Analysis Notes:
- seq-jvm-8-seq-mapper and par-jvm-8-seq-mapper has a significant running time difference. It is not yet clear why this is the case. I originally thought it has to do with increased garbage collection, but I have no evidence supporting the claim. In fact, it is observed that the par-jvm-8-seq-mapper uses less memory. The slowdown could be related to the large number of mappers running simultaneously. TODO: Prof. Cox suggested that I try seq-jvm-2-mapper and par-jvm-2-mapper and see if a significant slow down is still observed. The results coming out of this experiment might indicate that there is some kind of overhead associated with starting up a task in the par JVM. (could be related to the waiting seconds)
- The results show that par-jvm-1-par-mapper or par-jvm-2-par-mapper performs at the same level as seq-jvm-1-par-mapper or seq-jvm-2-par-mapper. This shows that the hybrid approach has no negative impact on the running time of the system.
- The result show that 2 parallel mapper performs better than 1 parallel mapper, a little bit better than 8 seq mapper. The reason for the performance improvement is improved IO utilization, improved IO/Computation overlap. When one mapper is blocked on IO, the other mapper could be using full 8 cores. TODO: Show improved CPU utilization for 2 parallel mapper by graphing a CPU over time graph.
- Another experiment that Prof Cox suggested on why 3 parallel mapper is faster than 2 mapper, can we save cycles from frequent context switches. TODO: adjust the number of threads per mapper for increasing number of mappers. The goal is that each mapper has only enough threads to keep the CPU fully utilized when one mapper is blocked on IO. For example, for two mappers, 8 threads each (full utilization for 8 cores when one mapper is blocked). for three mappers (8/ (3-1) = 4 threads each).
Next, we showed the heap memory utilization results for the same 6 configurations. Previous experiments using top command was unsuccessful because it was hard to spot changes in resident memory when we increase or decrease number of copies of cluster data in memory. We discussed the following topics
- Why is there no significant fluctuation in heap memory utilization?
- My results show that for 63 MB cluster data size, the fluctuation in heap memory utilization is relatively low. Prof. Alan Cox explained that it is mostly because a lot of light-weight garbage collection happens throughout the execution. For lightweight garbage collection activities, they don’t do “unmmap” command to return the memory to the operating system. Sometimes, the JVM performs a heavy weight garbage collection, “full garbage collection” that returns the memory to the operating system. This can be reflected in my GC results, 30000 ish regular garbage collection operations in a typical run, with only 5-6 full garbage collection.
- How to further improve memory measurement techniques?
- One of the issues with measuring the heap utilization with “top” is that it is not very accurate. It doesn’t show a significant different when we increase the number of mappers.
- We used the mxbeans library (please see previous post) to measure the heap memory utilization.
- Visual VM for seeing the memory used for the objects. (please see previous post)
- Quantifying how much memory are we saving exactly?
- This is shown in the following table,
8 seq mapper 1 par mapper 2 par mapper par jvm seq jvm par jvm seq jvm par jvm seq jvm 30 3274 3840 2243 2183 63 4246 5816 2386 2243 2515 2700
- Notes: from the above table, we can derive
- roughly the in-memory data structure is 2x the text data size. For example, 63 MB of text data for cluster centroids would go to around 140 MB. TODO: this needs to be further verified with a heap dump analysis using visual JVM.
- How to develop better scripts to automatically collect the heap memory utilization data?
- For parallel JVM, this is easy, there is only a single stderr file, where we currently print all the heap data. This gets tricky for sequential JVM because the stderr file is spread out across several locations. TODO: The least we need to do is for parallel JVM, automatically find the one stderr file in the logs directory, run the analyze script and print the result to the output file in output directory. This would save us a lot of time moving forward.
TODO: Design a full sequence of work items for benchmarking the performance characteristics of applications. Do it for KNN and Hashjoin