Why do I see 400x timeouts with a frequent call to clock_gettime?

I'm trying to measure the execution time of some C ++ commands using a physical clock, but I ran into a problem that the process of reading measurements from a physical clock on a computer can take a lot of time. Here is the code:

#include <string> #include <cstdlib> #include <iostream> #include <math.h> #include <time.h> int main() { int64_t mtime, mtime2, m_TSsum, m_TSssum, m_TSnum, m_TSmax; struct timespec t0; struct timespec t1; int i,j; for(j=0;j<10;j++){ m_TSnum=0;m_TSsum=0; m_TSssum=0; m_TSmax=0; for( i=0; i<10000000; i++) { clock_gettime(CLOCK_REALTIME,&t0); clock_gettime(CLOCK_REALTIME,&t1); mtime = (t0.tv_sec * 1000000000LL + t0.tv_nsec); mtime2= (t1.tv_sec * 1000000000LL + t1.tv_nsec); m_TSsum += (mtime2-mtime); m_TSssum += (mtime2-mtime)*(mtime2-mtime); if( (mtime2-mtime)> m_TSmax ) { m_TSmax = (mtime2-mtime);} m_TSnum++; } std::cout << "Average "<< (double)(m_TSsum)/m_TSnum << " +/- " << floor(sqrt( (m_TSssum/m_TSnum - ( m_TSsum/m_TSnum ) *( m_TSsum/m_TSnum ) ) ) ) << " ("<< m_TSmax <<")" <<std::endl; } } 

Then I run it on a dedicated kernel (or, as sysadmin tells me) to avoid problems with moving the process to the background using the scheduler:

 $ taskset -c 20 ./a.out 

and I get:

 Average 18.0864 +/- 10 (17821) Average 18.0807 +/- 8 (9116) Average 18.0802 +/- 8 (8107) Average 18.078 +/- 6 (7135) Average 18.0834 +/- 9 (21240) Average 18.0827 +/- 8 (7900) Average 18.0822 +/- 8 (9079) Average 18.086 +/- 8 (8840) Average 18.0771 +/- 6 (5992) Average 18.0894 +/- 10 (15625) 

It is so clear that calling clock_gettime() takes about 18 nanoseconds (on this particular server), but I can’t understand why the “maximum” time seems to be 300 to 1000 times longer?

If we assume that the kernel is really dedicated to this process and is not used by some other one (which may or may not be true; if it does not work on a dedicated kernel, the average time will be the same, but sd / max is slightly longer), What else can cause these "slowdowns" (due to the lack of a better name)?

+5
source share
3 answers

Why Outliers?

There are many reasons why you can see events with more complex events when you repeat 10 million times on two clock_gettime calls:

  • Context switches: even tied to a processor, the OS may periodically decide to run something else on your logical processor.
  • SMT : it is assumed that this is on a CPU with SMT (for example, x86 hyperthreading), the scheduler will probably periodically plan something in the marriage core (the same physical core as your process). This can significantly affect the overall performance of your code, as two threads compete for the same core resources. Moreover, there is probably a transition period between SMT and non-SMT execution, where nothing is executed, since the kernel must re-parry some resources when starting SMT.
  • Interrupts: A typical system will receive hundreds of interrupts per second from at least a network card, graphics devices, hardware clocks, system timers, audio devices, input / output devices, IPIs, etc. Try watch -n1 cat /proc/interrupts and see how the action happens in relation to what, in your opinion, is an inaction system.
  • Hardware suspensions: the processor itself may periodically stop executing instructions for various reasons, such as power consumption or thermal control, or just because the CPU is subject to a transition frequency .
  • System management mode : completely excluding interrupts, visible and processed by the OS, x86 processors are of the “hidden interrupt” type, which allows SMM functionality to be executed on your processor, the only obvious effect being periodic unexpected transitions in cycle counters used to measure real time.
  • Normal performance changes: your code will not execute exactly the same every time. Initial iterations will suffer from failures in the data and instruction cache and have unprepared predictors for things like branch direction. Even in an explicit “steady state,” you can still experience performance changes that are beyond your control.
  • Different code paths: you can expect your loop to execute exactly the same instructions every time after 1 : after all, nothing changes, does it? Well, if you delve into the inside of clock_gettime , you can very well find something branches that go a different way when some overflows occur, or when reading from the adjustment factors in VDSO races with updating, etc.

This is not even a complete list, but it should at least give you some of the factors that can cause emissions. You can eliminate or reduce the effect of some of them, but full control is generally impossible in a modern operating system without real-time support 2 on x86.

My guess

If I had to guess from a typical outlier of ~ 8000 ns, which is probably too small to interrupt context switching, you probably see the effect of scaling the processor frequency due to TurboBoost variables. These are common, but mostly modern x86 chips run at different max turbo speeds depending on how many cores are active. For example, my i7-6700HQ will operate at a frequency of 3.5 GHz if one core is active, but only 3.3, 3.2, or 3.1 GHz if 2, 3, or 4 cores are active.

This means that even if your process is never interrupted, any work that is performed even briefly on another CPU can cause a frequency transition (for example, because you are switching from m 1 to 2 active cores), and during such a transition the CPU works idling for thousands of cycles, while the voltage stabilizes. You can find detailed numbers and tests in this answer , but the result is that stabilization takes about 20,000 cycles on the tested CPU, which is very consistent with your observed emissions of ~ 8000 nanoseconds. Sometimes you can get two transitions for a period that doubles the impact, etc.

Narrow down

Get distribution

If you still want to know the cause of your outliers, you can take the following steps and observe the impact on the behavior of the outburst.

You must collect more data first. Instead of just recoding max over 10,000,000 iterations, you should collect a histogram with some reasonable bucket size (say, 100 ns or even better some kind of geometrical bucket size that gives a higher resolution for shorter times). This will be a huge help, because you can see exactly where the time is clustered: it is possible that you have other effects besides 6000 - 17000 ns, which you mark with "max", and they can have different reasons.

The histogram also allows you to understand the frequency of departure, which you can correlate with the frequencies of things that you can measure to see if they coincide.

Now adding the histogram code also potentially adds more variance to the synchronization cycle, since (for example) you will be accessing different cache lines depending on the synchronization value, but this is controllable, especially because time is recorded outside the "time domain".

Troubleshooting Special Issues

To this end, you can try to systematically check the problems that I mentioned above to find out if they are the cause. Here are some ideas:

  • Hyperthreading: just disable it in the BIOS when running single-threaded tests, which fixes this entire class of problems in one move. In general, I found that this also leads to a gigantic decrease in fine-grained dough, so this is a good first step.
  • Frequency scaling: On Linux, you can usually turn off frequency assignment by setting the performance control to “performance”. You can disable super-rated (aka turbo) by setting /sys/devices/system/cpu/intel_pstate/no_turbo to 0 if you are using the intel_pstate driver. You can also manipulate turbo mode directly through the MSR if you have a different driver, or you can do it in the BIOS if all else fails. On a related issue, emissions mostly disappear when turbo is off, so you should try first.

    Assuming that you really want to use turbo in production, you can limit the maximum turbocharging ratio manually to a value that applies to N cores (for example, 2 cores), and then turn off other processors so that the maximum number of cores are active. Then you can work at your new maximum turbojet time, regardless of how many cores are active (of course, in some cases you can still be dependent on power, current or temperature).

  • Interrupts: You can search for “interrupt affinity” to try and move interrupts to / from your docked kernel and see the effect on distribution distribution. You can also count the number of interrupts (for example, through /proc/interrupts ) and see that the counter is enough to count the number of outliers. If you find that timer interruptions are the cause, you can explore the various "no deposit" (otherwise called "NOHZ") modes that your kernel offers to reduce or eliminate them. You can also calculate them directly using the x86 performance counter HW_INTERRUPTS.RECEIVED .
  • Context switches: you can use real-time priorities or isolcpus to prevent other processes from starting on your CPU. Keep in mind that context switching issues, while usually positioned as the main / only issue, are actually quite rare: in most cases, they usually occur at HZ speed (often 250 / second on modern kernels) - but this will rarely be occurs on a largely inactive system, which the scheduler will actually decide to schedule another process on your busy CPU. If you make your control cycles short, you can almost completely avoid context switches.
  • Code related performance changes: you can check to see if this happens with various profiling tools such as perf . You can carefully construct the core of the package processing code to avoid events such as cache misses, for example, by touching the cache lines first, and you could avoid using system calls with unknown complexity as much as possible.

While some of the above goals are solely for investigation, many of them will help you determine what causes the pauses, as well as mitigate them.

I am not aware of mitigating all problems, but, for example, for SMM, you may need specialized hardware or BIOS to avoid this.


1 Good, except perhaps in the case when the condition if( (mtime2-mtime)> m_TSmax ) triggered - but this should be rare (and perhaps your compiler made it without branches, and in this case there is only one way fulfillment).

2 Actually, it’s not clear that you can get “zero dispersion” even with a hard real-time OS: some x86-specific factors, such as SMM mode and DVL-connected kiosks, seem inevitable.

+5
source
Command

taskset determines the affinity of YOUR process, which means that your process is limited to run on specified processor cores. It does not limit other processes in any way, which means that any of them can unload your process at any time (since all of them are allowed to work on the processor core that you have chosen for your process). Thus, your maximum time intervals (these 5-25 microseconds) can represent other processes or interruptions in the operating time on your processor plus context switching time. In addition, you use CLOCK_REALTIME , which may be subject to NTP adjustments, etc. To measure time intervals, you must use CLOCK_MONOTONIC (or linux-specific CLOCK_MONOTONIC_RAW ).

+3
source

It is much easier in modern C ++.

 #include <chrono> auto start = std::chrono::steady_clock::now(); ..... auto stop = std::chrono::steady_clock::now(); auto duration = stop - start; 

18 nanoseconds is pretty fast for a non-real-time operating system. Do you really need to measure something more accurate than that? According to my calculations, 18ns is just 72 clock cycles on a 4 GHz processor.

-2
source

Source: https://habr.com/ru/post/1275812/


All Articles