This is a post summarizing how to accurately measure memory bandwidth with uncore counters. It is believed to be more accurate than LLC_MISS * 64 bytes (size of cache line) because the LLC_MISS counter would not include prefetch misses.
This uses uncore counters that can be accessed through perf but it is not well documented.
This documents my research into how to measure the memory bandwidth of the system.
The first approach, which is a common wisdom in the area, is to use LLC_MISS (last level cache miss) * 64 Bytes (size of cache line). However, the problem with this approach is that he LLC_MISS counter would not include prefetch misses. This can be a huge issue when there are a lot of prefetching activities involved (for example, when there is streaming access involved in the program).
To find a better measure of that, I took the advice of Vlad from COMMIT group at MIT to look into a metric called MEMORY_BW_READS (memory bandwidth consumed by reads. Expressed in Bytes). MEMORY_BW_READS is a Uncore IMC (Intel memory controller) derived metric, implying that it is the result of combining the counts from multiple counters.
The document that I found is most useful is
The version I read is inserted here
It has a section on IMC events. On page 64, in Table 2-65, which is a really useful table, MEM_BW_READS = CAS_COUNT.RD*64 (size of cache line).
CAS_COUNT.RD is cache lines read in for read operations.
CAS_COUNT.RD is a counter and the way we can access it is documented in Table 2-64, page 63 of the document.
CAS_COUNT is 0x4, however, we need to figure out what the .RD is. So we move to table 2-66, and that tells us that RD is unmask 000011, which is 0x3.
Finally, to measure the cache lines transferred through all 4 channels of DDR3, we use the -a flag for perf to measure the system wide stats. And we use –per-socket flag to aggregate by sockets.
Finally, we arrive at the command using perf
perf stat -e uncore_imc_0/event=0x4,umask=0x3/,uncore_imc_1/event=0x4,umask=0x3/,uncore_imc_4/event=0x4,umask=0x3/,uncore_imc_5/event=0x4,umask=0x3/ -a –per-socket
This worked for me on a
Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz
CPU. The output look like the following
S0 1 47,393,161 uncore_imc_0/event=0x4,umask=0x3/
S0 1 74,551,419 uncore_imc_1/event=0x4,umask=0x3/
S0 1 44,469,596 uncore_imc_4/event=0x4,umask=0x3/
S0 1 75,110,641 uncore_imc_5/event=0x4,umask=0x3/
S1 1 168,373 uncore_imc_0/event=0x4,umask=0x3/
S1 1 30,317,419 uncore_imc_1/event=0x4,umask=0x3/
S1 1 147,860 uncore_imc_4/event=0x4,umask=0x3/
S1 1 30,403,979 uncore_imc_5/event=0x4,umask=0x3/
Note: You DO NOT need root access but you need to have the administrator set /proc/sys/kernel/perf_event_paranoid to -1
Otherwise, an error message would show up
> > You may not have permission to collect system-wide stats.
>> > Consider tweaking /proc/sys/kernel/perf_event_paranoid:
>> > -1 – Not paranoid at all
>> > 0 – Disallow raw tracepoint access for unpriv
>> > 1 – Disallow cpu events for unpriv
>> > 2 – Disallow kernel profiling for unpriv
Other tools for monitoring NUMA traffic
Apart fromt he uncore counters we showed earlier, you can also use pcm
run ./pcm.x (often require sudo access).
If you get an error on too many files are open, then use ulimit -n 10000 to deal with it. (it is opening a file for each core).
Hope this helps in measure memory bandwidth!