This is a note summarizing the things I learned about the usage of NUMACTL, taskset, libnuma. This feature can be potentially important for running parallel programs on NUMA architectures.
Quick Table of Contents
- Background in NUMA access
- NUMACTL notes
- Taskset notes
- Combine takset and numactl (don’t do it!)
- lscpu (see which cores belong to which sockets)
- Libnuma notes
- Other tools for monitoring NUMA traffic
First, modern processors often take a NUMA (Non Uniform Memory Access) approach to hardware design. This post gives the best introduction to NUMA through comparisons with UMA and some good graphs.
This post not only explains what NUMA is, but also the potential performance impact with migrating processes to different cores.
We sometimes want to control the way threads are assigned to cores for reasons including (1) want to use hardware threads and avoid using hyper threading (2) make sure my task doesn’t migrate around frequently.
To do this, the linux operating system provides a function called NUMACTL, the documentation can be found here,
NUMACTL gives the ability to control
- NUMA scheduling policy
- for example, which cores do I want to run these tasks on
- Memory placement policy
- where to allocate data
To quickly get started, you should try use
to checkout the NUMA architecture of your system
To understand the output, I followed this post here https://www.sharcnet.ca/help/index.php/Using_numactl
It is pretty helpful in explaining what the numbers mean. Here, I just want to make a note that if hyper threading is enabled, then you might see 48 cores, but it is actually just 24 physical cores. For more info, look up how to read /proc/cpuinfo
For me the output on compute node, which has 24 physical cores, 4 virtual cores, 2 sockets
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 64398 MB
node 0 free: 43311 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64508 MB
node 1 free: 46774 MB
node 0 1
0: 10 21
1: 21 10
This just shows that I have two memory groups, each with 64GB of memory, I have a total of 48 virtual cores. As documented in –physcpubind, this accepts cpu numbers as shown in the processor fields of /proc/cpuinfo. Thus, if hyper threading is enabled, the virtual cores representing a hyper thread, not a hardware thread will show up as well.
Now, the man page http://linux.die.net/man/8/numactl already shows some examples of using numactl such as –interleave, –physcpubind. I just want to go through a few additional examples,
One of them is from Comp 422 class back at Rice University
- numactl –localalloc –physcpubind=0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124
What this is doing is to use only physical threads and avoid using hyper threads. This works becausee as noted in the page, each node on BIOU (the cluster this assignment is suppose to run) has 128 threads, but it only has 32 cores. This implies that each core can use 4 threads and we want to avoid using more than 1 thread per core. Thus, we use physcpubind to force the system to allocate tasks to only hardware threads.
–localalloc is also useful because it makes sure each thread allocated on its own node.
One other example I used myself is the following to play with membind
numactl –physcpubind=0 –membind=0 ./pagerankCsr.o mediumGraph
This binds the memory allocation to the first memory group, in my case, socket 0, and using MEMORY_BW_READS described in my previous post https://yunmingzhang.wordpress.com/2015/07/22/measure-memory-bandwidth-using-uncore-counters/
This command gives the following output
Performance counter stats for ‘system wide’:
S0 1 12,955,877 uncore_imc_0/event=0x4,umask=0x3/
S0 1 14,013,046 uncore_imc_1/event=0x4,umask=0x3/
S0 1 12,935,697 uncore_imc_4/event=0x4,umask=0x3/
S0 1 14,031,470 uncore_imc_5/event=0x4,umask=0x3/
S1 1 17,244 uncore_imc_0/event=0x4,umask=0x3/
S1 1 1,142,200 uncore_imc_1/event=0x4,umask=0x3/
S1 1 14,080 uncore_imc_4/event=0x4,umask=0x3/
S1 1 1,147,476 uncore_imc_5/event=0x4,umask=0x3/
notice how most of the cache line replacements are coming in in socket 0 as highlighted in bold.
numactl –physcpubind=0 –localalloc ./pagerankCsr.o mediumGraph
This command gives a similar output because we first bind cpu to 0 and 0 is in memory node 0 (socket 0) , the output looks like the following
S0 1 12,978,178 uncore_imc_0/event=0x4,umask=0x3/
S0 1 14,037,751 uncore_imc_1/event=0x4,umask=0x3/
S0 1 12,953,229 uncore_imc_4/event=0x4,umask=0x3/
S0 1 14,052,675 uncore_imc_5/event=0x4,umask=0x3/
S1 1 16,075 uncore_imc_0/event=0x4,umask=0x3/
S1 1 1,144,211 uncore_imc_1/event=0x4,umask=0x3/
S1 1 13,020 uncore_imc_4/event=0x4,umask=0x3/
S1 1 1,148,139 uncore_imc_5/event=0x4,umask=0x3/
Now we try to force the program to allocate on memory node 1
numactl –physcpubind=0 –membind=1 ./pagerankCsr.o mediumGraph
This gives the following output
S0 1 97,904 uncore_imc_0/event=0x4,umask=0x3/
S0 1 1,966,565 uncore_imc_1/event=0x4,umask=0x3/
S0 1 88,200 uncore_imc_4/event=0x4,umask=0x3/
S0 1 1,984,923 uncore_imc_5/event=0x4,umask=0x3/
S1 1 12,906,166 uncore_imc_0/event=0x4,umask=0x3/
S1 1 14,713,862 uncore_imc_1/event=0x4,umask=0x3/
S1 1 12,885,845 uncore_imc_4/event=0x4,umask=0x3/
S1 1 14,715,551 uncore_imc_5/event=0x4,umask=0x3/
As we can see, now most of the cacheline brought in are from socket 1.
To make matters worse, this particular scenarios demonstrates the importance of having the cpu thread allocate the data locally, as the running time jumps from 2.08s to 3.47s (1.6x slow down), showing the penalty of accessing remote memory.
Now this is going to be even more important for performance engineering on parallel programs.
Sometimes, you might want to free up memory after long running NUMA programs,
numactl -H | grep free
to see how much free memory there are on each node. If it is low (which can happen if you have a lot of slab objects or PageCache),
Try the following commands (require root priviledges)
echo 3 > /proc/sys/vm/drop_caches
My experience has been this can help freeing up unused caches and keep your performance stable after a series of memory-consuming operations.
A documentation for drop_caches can be found here
You can use
numactl -N 0 -m 1./test (use all the cores in socket 0, but allocate memory in socket 1)
numactl –physcpubind=0 –membind=1 ./test (use only 1 core in socket 0, but allocate memory in socket 1) .
You can use lscpu to see how all the core IDs correspond to different socket cores, hyperthreads. (explained in greater detail below).
Other Tools for NUMA access control
There are two major tools out there for controlling NUMA access through command line (bash script). In this article, we focus on NUMACTL, but in practice, “taskset” is also often used.
Checkout the documentation here
For example, I often use this for making sure a program runs on a single core
taskset -c 0 ./executable args…
You can also use it to disable hyperthreads and make it run on a single memory bank.
taskset -c 0-11 ./executable args
(this would bind it to the first 12 cores, 12-23 would be hyper threads)
Combine takset and numactl -i all,
(taskset -c 0-x numactl -i all ..) Not a good idea, weird performance results.
Probably just use taskset alone in this case if you are trying to measure performance scalability.
To figure out which numbers correspond to what cores, the easiest way is to use
“lscpu” , which gives the following information (among many other information)
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
This shows that 0-11, 24-35 are the 24 hardware threads in the 12 cores NUMA node 0. 24-35 can be thought of as the hyper threads.
“taskset -c 0-11 command” uses the 12 cores in socket 1 (NUMA node 0) without using their hyper threads.
“taskset -c 0-11,24-35 command” uses the 12 cores in socket 1 (NUMA node 0) along with their hyper threads.
“taskset -c 0-23 command” uses the 24 cores in both sockets (NUMA node 0, NUMA node1) without using their hyper threads.
“taskset -c 0-47 command” uses the 24 cores in both sockets (NUMA node 0, NUMA node1) with their hyper threads. This should be similar to “numactl -i all”.
Hope this tutorial helps!
You can also bind the threads in the code with libnuma,
documentations can be found here
and another good documentation here
A more complete set of notes on both numactl and libnuma from Andi Kleen
Other tools for monitoring NUMA traffic
Apart fromt he uncore counters we showed earlier, you can also use pcm
run ./pcm.x (often require sudo access).
If you get an error on too many files are open, then use ulimit -n 10000 to deal with it. (it is opening a file for each core).