Process/thread pinning (WIP)
Overview
In modern machines, memory access between processor core to the main memory is not uniform. The machine memory is in separate regions called “NUMA domains”. Therefore, for optimal performance, cores should prioritize access to memory that is in its nearest NUMA domain.
Process/thread pinning, also called CPU affinity, refers to binding or “pinning” specific workload processes or threads to specific CPU cores. The goal is to optimize performance by improving cache utilization, using resources evenly, and ensuring consistent execution times. On Non-Uniform Memory Access (NUMA) architectures, pinning processes/threads to specific cores can help ensure that memory accesses are local to the processor, reducing latency. It also helps on maximizing the use of L3 cache.
MPI (Message Passing Interface) implementations include support for process affinity. For instance, OpenMPI, IntelMPI, HPC-x, and MPICH allow users to specify CPU binding policies through environment variables or command-line options. SLURM and PBS also allow such specification when starting user jobs.
In hybrid parallel applications, that is, those mixing distributed and shared memory (e.g. OpenMPI and MPI), each process has multiple threads so, handling the mapping of threads to leverage the same L3 cache is beneficial.
A bit on OpenMPI pinning flags
Depending on the MPI implementation, for instance OpenMPI, there are flags to specify how the pinning should take place.
Here, on
page 84 onwards, you will find examples for OpenMPI --map-by
, --rank-by
and
--bind-to
. This following reference also has an interesting description about
this topic:
[LINK].
Terminology:
NUMA node or NUMA domain
- group of one or more CPUs (or cores) and the memory that is local to those CPUs.
Socket
- the physical connector on the motherboard that holds the CPU, which may contain multiple CPU cores and multiple NUMA nodes.
Core Chiplet Die (CCDs)
- physical groupings of CPU cores within a single piece of silicon, commonly found in modern AMD processors.
Core CompleX (CCX)
- basic unit of the processor, i.e., group of CPU cores sharing resources like L3 cache and memory controllers within a Core Complex (CCD).
Slots
- How many processes allowed on a node
- Oversubscribe: #processes > #slots
Processing element (“PE”)
- Smallest atomistic processor
- Frequently core or HT
- Overload: more than one process bound to PE
Mapping
- Assign each process to a location
- Determines which processes run on what nodes
- Detects oversubscription
Ranking
- Assigns MPI_COMM_WORLD rank to each process
Binding
- Constrain a process to execute on specific resources on a node.
- Processes can be bound to any resource on a node, including NUMA regions, sockets, and caches.
- Binds processes to specific processing elements
- Detects overload
The following document contains an overview of the relationship between CCD and CCX (4TH GEN AMD EPYC™ PROCESSOR ARCHITECTURE):
https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf
In OpenMPI:
--map-by
: Specifies the policy for mapping processes to hardware resources.
slot
: Use Open MPI’s default mapping.node
: Map processes by node.socket
: Map processes by socket.core
: Map processes by core.numa
: Map processes by NUMA node.
Example: mpirun --map-by socket ./your_program
--bind-to
: Specifies the policy for binding processes to CPUs.
none
: No binding.core
: Bind to individual cores.socket
: Bind to sockets.numa
: Bind to NUMA nodes.board
: Bind to entire boards.
Example: mpirun --bind-to core ./your_program
--report-bindings
: Reports the binding of processes after execution.
In summary, --map-by
controls where processes are initially placed (i.e.
assign processes a hardware component), whereas
--bind-to
to control how processes are pinned to specific hardware resources
once placed (i.e. restrict the motion of processes in the mapped hardware).
If you want to test the process pinning without running your application, you
can use something simpler, such as hostname
application.
mpirun --report-bindings -np 88 --bind-to core --map-by ppr:2:socket hostname
Check compute node information
This tool shows NUMA-related info of a machine and controls NUMA policy. Using this flag you can observe output of NUMA info.
numactl --hardware
Here is an example of output for Azure HBv3 machine (which contains 2 sockets and each socket contains 4 NUMA nodes):
$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
node 0 size: 114839 MB
node 0 free: 113693 MB
node 1 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 1 size: 114905 MB
node 1 free: 113538 MB
node 2 cpus: 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
node 2 size: 114863 MB
node 2 free: 112997 MB
node 3 cpus: 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 3 size: 114904 MB
node 3 free: 112898 MB
node distances:
node 0 1 2 3
0: 10 12 32 32
1: 12 10 32 32
2: 32 32 10 12
3: 32 32 12 10
The memory latency distances between a node and itself is normalized to 10 (1.0x). So, the distance between NUMA Node 0 and 2 is 32 (3.2x).
The distances can also be obtained:
$ cat /sys/devices/system/node/node*/distance
If you want to obtain a measured distance for core-2-core, you can use this tool below:
https://github.com/vgamayunov/c2clat
Here is an example of output (for HBv3). The figure was generated using gnuplot but data from c2clat can be imported in excel for visualization too.
You can get the number of sockets:
$ lscpu | grep 'Socket(s):'
Socket(s): 2
You can see the socket id for each NUMA domain/node:
$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 1 1:1:1:0 yes
2 0 0 2 2:2:2:0 yes
3 0 0 3 3:3:3:0 yes
...
You can also use hwloc-ls
and lstopo
, which provide text and graphical
representation of the machine topology, NUMA nodes, cache info, and processor
mapping. You can also get some processor info using cat /proc/cpuinfo
.
Example of output for lstopo
(for an HBv3 machine):
The tool lstopo
come from hwloc
package. The output can be text only or you
can generate an image. For the image support, you may need to install
hwloc-gui
package. Then run:
lstopo --output-format png --no-legend --no-io output.png
Pinning on Azure HPC VMs
The following blog describes the Azure HB Virtual Machine (VM) series and the importance of properly mapping processes/thread to the multiple cores of the VM processor. The blog post also provides an overview on taking care about process placement when undersubscribing a VM (i.e. using fewer cores than they have). For instance, HBv3 has 120 AMD cores; so if one desires to use 16 cores, Azure offers a constrained HB120-16rs_v3, or the user can take care of the process placement with the instructions from the blog.
There is another blog, which describes a python tool to assist on process pinning for Azure HPC SKUs:
References
- HPCWiki - Binding/Pinning
- blog: Optimal MPI Process Placement for Azure HB Series VMs
- blog: Tool to assist in optimal pinning of processes/threads for Azure HPC/AI VM’s
- CPU pinning script 1 - azure hpc benchmarking
- CPU pinning script 2 - azure hpc experimental
- CPU pinning script 3
- examples on OpenMPI –map-by, –rank-by and –bind-to (page 84 onwards)
- OpenMPI docs
- CW2020 S4V1 - Process and Thread Affinity with MPI/OpenMP - Yun He
- 4TH GEN AMD EPYC™ PROCESSOR ARCHITECTURE
- c2clat tool - core to core latency
- lstopo tool (from hwloc)
- discussion on docs about OpenMPI mapped-by,rank-by, and bind-to
- OpenMPI Wiki - Process Placement
- Binding/Mapping/Ranking Page 66
- PETSc discussion on mpi process mapping
- blog: Performance & Scalability of HBv4 and HX-Series VMs with Genoa-X CPUs (Jun 2023)
- blog: Performance of Azure HBv4 and HX VMs for HPC (Nov 2022)
- HBv4 overview
- HBv3 overview
- Azure HPC apps setup git repo
- Compiling/scaling HPC apps overview tips
- Choosing MPI in Azure overview
- Azure HPC images - release notes