-Map-by option syntax in openmpi mpirun v1.8

Take a look at the following excerpt from the openmpi manual

--map-by <foo> Map to the specified object, defaults to socket. Supported options include slot, hwthread, core, L1cache, L2cache, L3cache, socket, numa, board, node, sequential, distance, and ppr. Any object can include modifiers by adding a : and any combination of PE=n (bind n processing elements to each proc), SPAN (load balance the processes across the allocation), OVERSUBSCRIBE (allow more processes on a node than processing elements), and NOOVERSUBSCRIBE. This includes PPR, where the pattern would be terminated by another colon to separate it from the modifiers. 

I have different questions regarding the syntax and some comments on them:

  • What do the options sequential , distance and ppr ?

Especially ppr puzzles me. What is this abbreviation?

  • How can I understand the options, for example --map-by ppr:4:socket regarding the extract from the manual?

Of course, I can see the result of the previous version by looking at the reported bindings with --report-bindings (only 4 processes are mapped to one socket and are bound to 4 kernels of the same socket by default), but I cannot understand the syntax. Another line of the manual says that this new option replaces the deprecated use of --npersocket :

 -npersocket, --npersocket <#persocket> On each node, launch this many processes times the number of processor sockets on the node. The -npersocket option also turns on the -bind- to-socket option. (deprecated in favor of --map-by ppr:n:socket) 
+6
source share
1 answer

ppr means processes per resource. Its syntax is ppr:N:resource , and that means "assign N processes for each resource of a resource of the type available on the host." For example, in a four- --map-by ppr:4:socket system with 6-core processors having --map-by ppr:4:socket , the following process map is obtained:

  socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ---- core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 process ABCDEFGHIJKLMNOP 

(process numbering works from A to Z in this example)

Which means a manual means that the whole ppr:N:resource should be considered as one specifier and that parameters separated by : can be added after it, for example. ppr:2:socket:pe=2 . This should read as β€œrun two processes on each socket and associate each of them with two processing elements” and cause the following card to have the same system with four sockets:

  socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ---- core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 process AABBCCDDEEFFGGHH 

The sequential converter reads the host file line by line and starts one process on the found host. It ignores the number of slots, if specified.

The dist converter maps processes on NUMA nodes to the distance of the latter from a given PCI resource. This only makes sense on NUMA systems. Again, let me use the quad-socket system toy, but this time expand the view to show the NUMA topology:

  Socket 0 ------------- Socket 1 | | | | | | | | | | Socket 2 ------------- Socket 3 | ib0 

Lines between sockets are links to the CPU. This, for example, QPI for Intel processors and HT-links for AMD processors. ib0 is an InfiniBand HCA used to communicate with other compute nodes. Now, on this system, Socket 2 talks directly to HCI InfiniBand. Socket 0 and Socket 3 must cross one CPU channel to talk to ib0 , and Socket 1 must cross 2 CPU channels. This means that processes running on Socket 2 will have the lowest possible delay, while sending and receiving messages and processes on Socket 1 will have the highest possible delay.

How it works? If your host file specifies, for example, 16 slots on this host and the mapping --map-by dist:ib0 , this may lead to the following map:

  socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ---- core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 process GHIJKLABCDEFMNOP 

6 processes are mapped to Socket 2, which is closest to InfiniBand HCA, then 6 more are mapped to Socket 0, which is the second closest, and 4 more are mapped to Socket 3. It is also possible to distribute processes instead of linearly filling processing elements. --map-by dist:ib0:span results in:

  socket ---- 0 ---- ---- 1 ---- ---- 2 ---- ---- 3 ---- core 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 process EFGHMNOPABCDIJKL 

The actual NUMA topology is obtained using the hwloc library, which reads the distance information provided by the BIOS. hwloc includes a command line tool called hwloc-ls (also known as lstopo ), which can be used to display the topology of the system. Usually it only includes the processing topology and NUMA domains in its output, but if you specify the -v option, it also includes PCI devices.

+11
source

Source: https://habr.com/ru/post/981711/


All Articles