Linux recognizes Hyper-threaded core id

This morning I tried to figure out how to determine which processor identifier is a hyper-threaded core, but without any luck.

I want to find out this information and use set_affinity() to bind a process to a thread with a hyperthread or a thread without a hyperthread to determine its performance.

+10
source share
6 answers

I discovered a simple trick to do what I need.

 cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 

If the first number is equal to the CPU number (0 in this example), then this is the real core, if it is not a hyper-threading core.

Real basic example:

 # cat /sys/devices/system/cpu/cpu1/topology/thread_siblings_list 1,13 

Hyperthreading Core Example

 # cat /sys/devices/system/cpu/cpu13/topology/thread_siblings_list 1,13 

The result of the second example is exactly the same as the first. However, we check cpu13 , and the first number is 1 , so CPU 13 is a hyper-threaded core.

+30
source

HT is symmetric (in terms of basic resources, system mode may be asymmetric).

So, if HT is enabled, large physical core resources will be shared between the two threads. Some additional hardware is included to maintain the state of both threads. Both threads have symmetric access to the physical core.

There is a difference between an HT-disabled kernel and an HT-enabled kernel; but there is no difference between the first half of the core with HT support and the second half of the core with HT support.

At one point in time, one HT thread can use more resources than others, but this resource balancing is dynamic. The processor will balance the threads as much as possible, and as it wants both threads to use the same resource. You can only do rep nop or pause in one thread to allow the CPU to give more resources to another thread.

I want to find out this information and use set_affinity () to bind a process to a thread with a hyperthread or a thread without a hyperthread to profile its performance.

Well, you can really measure performance without knowing the fact. Just execute the profile when the only thread in the system is tied to CPU0; and repeat it when it is attached to CPU1. I think that the results will be almost the same (the OS can generate noise if it associates some interrupts with CPU0, so try to reduce the number of interrupts during testing and try to use CPU2 and CPU3, if any).

PS

Agner (he is the x86 Guru) recommends using even kernels in case you do not want to use HT, but it is included in the BIOS:

If hyperthreading is detected, block the process to use only logical processors with an even number. This will make one of the two threads in each processor core inactive so that there is no competition for resources.

PPS On the new HT reincarnation (not P4, but Nehalem and Sandy) - based on Agner microarchitecture research

The new bottlenecks that require attention in Sandy Bridge are: ... 5. Resource sharing between threads. Many of the critical resources are between two threads of the kernel when hyperthreading is enabled. It may be wise to turn off overthreading when multiple threads are dependent on the same execution resources.

...

A peninsular solution was introduced at NetBurst and again at Nehalem and Sandy Bridge with what is called hyperthreading. A hyperthreading processor has two logical processors using the same execution core. The advantage of this is limited if two threads compete for the same resources, but hyperthreading can be very beneficial if performance is limited by something else, such as memory access.

...

Both Intel and AMD are developing hybrid solutions in which some or all of the executive devices are distributed between two processor cores (hyperthreading in Intel terminology).

PPPS: The Intel Optimization Book lists resources for sharing resources in the second generation of HT: (page 93, this list is for nehalem, but there is no change to this list in the Sandy section)

Deeper buffering and advanced resource / partition access policies:

  • - Replicated resource for working with HT: registration status, renamed return stack buffer, large-page ITLB // comment from me: there are two sets of this HW
  • - Shared resources for HT: loading buffers, storage buffers, re-ordering buffers, ITLBs with small pages are statically distributed between two logical processors. // Comment from me: there is one set of this HW; it is statically split between two HT virtual cores in two halves
  • - Competitive shared resource when running HT: backup station, cache hierarchy, fill buffers, both DTLB0 and STLB.//comment: A single set, but not divided in half. The CPU will dynamically reallocate resources.
  • - Interleaving during HT operation: front-end operations are usually interleaved between two logical processors to ensure fairness. // comment: there is one Frontend (command decoder), so the streams will be decoded in the order: 1, 2, 1, 2.
  • - HT unknown resources: execution units. // comment: there are real hw-devices that will perform calculations, memory accesses. There is only one set. If one of the threads is capable of using many execution blocks, and if it expects a small amount of memory, it will consume all exec blocks, and the performance of the second thread will be low (but HT sometimes switches to the second thread. How often?). If both threads are not weight optimized and / or expect memory, execution units will be split between the two threads.

There are also images on page 112 (Figure 2-13), which shows that both logical cores are symmetrical.

Performance potential thanks to HT technology is due to:

  • • The fact that operating systems and user programs can schedule processes or threads to run simultaneously on logical processors in each physical processor
  • • Ability to use resources on a chip at a higher level than when one thread consumes execution resources; Higher resource utilization can result in higher system throughput.

Although instructions created from two programs or two threads are executed simultaneously and not necessarily programmatically in the execution kernel and memory hierarchy, the front and back ends contain several selection points for choosing between instructions from two logical processors. All selection points alternate between two logical processors if only one logical processor cannot use the scene pipeline. In this case, another logical processor makes full use of each scene pipeline cycle. Reasons why the logical processor cannot use the pipeline stage include cache misses, incorrect branch predictions, and team dependencies.

+11
source

I am surprised that no one mentioned lscpu . Here is an example of a single-processor system with four physical cores and a hyper-thread:

 $ lscpu -p # The following is the parsable format, which can be fed to other # programs. Each different item in every column has an unique ID # starting from zero. # CPU,Core,Socket,Node,,L1d,L1i,L2,L3 0,0,0,0,,0,0,0,0 1,1,0,0,,1,1,1,0 2,2,0,0,,2,2,2,0 3,3,0,0,,3,3,3,0 4,0,0,0,,0,0,0,0 5,1,0,0,,1,1,1,0 6,2,0,0,,2,2,2,0 7,3,0,0,,3,3,3,0 

The result explains how to interpret the identifier table; CPU logical identifiers with the same kernel identifier are siblings.

+7
source

There is a universal (Linux / Windows) and portable HW topology detector (kernels, HTs, cacafs, south bridges and a local network) - hwloc by the OpenMPI project. You can use it because linux can use different numbering rules for the main cells, and we cannot know whether this rule will be even / odd or y and y + 8 nubering.

Hwloc homepage: http://www.open-mpi.org/projects/hwloc/

Download page: http://www.open-mpi.org/software/hwloc/v1.10/

Description:

The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across the OS, version, architecture and ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, kernels and simultaneous multithreading. It also collects various system attributes such as cache and memory information, as well as the locality of I / O devices, such as network interfaces, HCI InfiniBand, or GPUs. It is primarily aimed at helping applications collect information about modern computing devices in order to use them accordingly and effectively.

It has the lstopo command to get the hw topology in graphical form, for example

  ubuntu$ sudo apt-get hwloc ubuntu$ lstopo 

lstopo from hwloc (OpenMPI) - output example

or in text form:

  ubuntu$ sudo apt-get hwloc-nox ubuntu$ lstopo --of console 

We can see the physical cores as Core L#x , each of which has two logical cores PU L#y and PU L#y+8 .

 Machine (16GB) Socket L#0 + L3 L#0 (4096KB) L2 L#0 (1024KB) + L1 L#0 (16KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#8) L2 L#1 (1024KB) + L1 L#1 (16KB) + Core L#1 PU L#2 (P#4) PU L#3 (P#12) Socket L#1 + L3 L#1 (4096KB) L2 L#2 (1024KB) + L1 L#2 (16KB) + Core L#2 PU L#4 (P#1) PU L#5 (P#9) L2 L#3 (1024KB) + L1 L#3 (16KB) + Core L#3 PU L#6 (P#5) PU L#7 (P#13) Socket L#2 + L3 L#2 (4096KB) L2 L#4 (1024KB) + L1 L#4 (16KB) + Core L#4 PU L#8 (P#2) PU L#9 (P#10) L2 L#5 (1024KB) + L1 L#5 (16KB) + Core L#5 PU L#10 (P#6) PU L#11 (P#14) Socket L#3 + L3 L#3 (4096KB) L2 L#6 (1024KB) + L1 L#6 (16KB) + Core L#6 PU L#12 (P#3) PU L#13 (P#11) L2 L#7 (1024KB) + L1 L#7 (16KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#15) 
+2
source

I tried to verify the information by comparing the core temperature and the load on the HT core.

enter image description here

+1
source

A simple way to get siblings hyper-threading processor cores in bash:

 cat $(find /sys/devices/system/cpu -regex ".*cpu[0-9]+/topology/thread_siblings_list") | sort -n | uniq 
0
source

Source: https://habr.com/ru/post/896433/


All Articles