ppr means processes per resource. Its syntax is ppr:N:resource , and that means "assign N processes for each resource of a resource of the type available on the host." For example, in a four- --map-by ppr:4:socket system with 6-core processors having --map-by ppr:4:socket , the following process map is obtained:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ---- core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 process ABCDEFGHIJKLMNOP
(process numbering works from A to Z in this example)
Which means a manual means that the whole ppr:N:resource should be considered as one specifier and that parameters separated by : can be added after it, for example. ppr:2:socket:pe=2 . This should read as βrun two processes on each socket and associate each of them with two processing elementsβ and cause the following card to have the same system with four sockets:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ---- core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 process AABBCCDDEEFFGGHH
The sequential converter reads the host file line by line and starts one process on the found host. It ignores the number of slots, if specified.
The dist converter maps processes on NUMA nodes to the distance of the latter from a given PCI resource. This only makes sense on NUMA systems. Again, let me use the quad-socket system toy, but this time expand the view to show the NUMA topology:
Socket 0 ------------- Socket 1 | | | | | | | | | | Socket 2 ------------- Socket 3 | ib0
Lines between sockets are links to the CPU. This, for example, QPI for Intel processors and HT-links for AMD processors. ib0 is an InfiniBand HCA used to communicate with other compute nodes. Now, on this system, Socket 2 talks directly to HCI InfiniBand. Socket 0 and Socket 3 must cross one CPU channel to talk to ib0 , and Socket 1 must cross 2 CPU channels. This means that processes running on Socket 2 will have the lowest possible delay, while sending and receiving messages and processes on Socket 1 will have the highest possible delay.
How it works? If your host file specifies, for example, 16 slots on this host and the mapping --map-by dist:ib0 , this may lead to the following map:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ---- core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 process GHIJKLABCDEFMNOP
6 processes are mapped to Socket 2, which is closest to InfiniBand HCA, then 6 more are mapped to Socket 0, which is the second closest, and 4 more are mapped to Socket 3. It is also possible to distribute processes instead of linearly filling processing elements. --map-by dist:ib0:span results in:
socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ---- core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 process EFGHMNOPABCDIJKL
The actual NUMA topology is obtained using the hwloc library, which reads the distance information provided by the BIOS. hwloc includes a command line tool called hwloc-ls (also known as lstopo ), which can be used to display the topology of the system. Usually it only includes the processing topology and NUMA domains in its output, but if you specify the -v option, it also includes PCI devices.