Why is my processor suddenly running twice as fast?

I am trying to use a simple profiler to evaluate the effectiveness of C code on a school server, and I am facing an odd situation. After a short period of time (half a second), the processor suddenly begins to execute instructions twice as fast. I tested all the possible reasons why I could think (caching, load balancing on the cores, the processor frequency changed due to getting out of sleep), but everything seems normal.

For what it's worth, I'm doing this testing on a school Linux server, so there might be an unusual configuration that I don’t know about, but the processor ID used does not change and (above) the server was completely inactive when I tested.

Test code:

#include <time.h> #include <stdio.h> #define MY_CLOCK CLOCK_MONOTONIC_RAW // no difference if set to CLOCK_THREAD_CPUTIME_ID typedef struct { unsigned int tsc; unsigned int proc; } ans_t; static ans_t rdtscp(void){ ans_t ans; __asm__ __volatile__ ("rdtscp" : "=a"(ans.tsc), "=c"(ans.proc) : : "edx"); return ans; } static void nop(void){ __asm__ __volatile__ (""); } void test(){ for(int i=0; i<100000000; i++) nop(); } int main(){ int c=10; while(c-->0){ struct timespec tstart,tend; ans_t start = rdtscp(); clock_gettime(MY_CLOCK,&tstart); test(); ans_t end = rdtscp(); clock_gettime(MY_CLOCK,&tend); unsigned int tdiff = (tend.tv_sec-tstart.tv_sec)*1000000000+tend.tv_nsec-tstart.tv_nsec; unsigned int cdiff = end.tsc-start.tsc; printf("%u cycles and %u ns (%lf GHz) start proc %u end proc %u\n",cdiff,tdiff,(double)cdiff/tdiff,start.proc,end.proc); } } 

The output I see:

 351038093 cycles and 125680883 ns (2.793091 GHz) start proc 14 end proc 14 350911246 cycles and 125639359 ns (2.793004 GHz) start proc 14 end proc 14 350959546 cycles and 125656776 ns (2.793001 GHz) start proc 14 end proc 14 351533280 cycles and 125862608 ns (2.792992 GHz) start proc 14 end proc 14 350903833 cycles and 125636787 ns (2.793002 GHz) start proc 14 end proc 14 350924336 cycles and 125644157 ns (2.793002 GHz) start proc 14 end proc 14 349827908 cycles and 125251782 ns (2.792997 GHz) start proc 14 end proc 14 175289886 cycles and 62760404 ns (2.793001 GHz) start proc 14 end proc 14 175283424 cycles and 62758093 ns (2.793001 GHz) start proc 14 end proc 14 175267026 cycles and 62752232 ns (2.793001 GHz) start proc 14 end proc 14 

I get a similar conclusion (with a different number of tests, doubles the efficiency) using different optimization levels (from -O0 to -O3).

Perhaps this is due to hyperthreading, when two logical cores in the physical core (the server uses Xeon X5560s, which can have this effect) can somehow "merge" to form one double processor?

+6
source share
3 answers

Some systems scale processor speed based on system load. As you rightly notice, this is especially annoying when benchmarking.

If your server is running Linux, enter

 cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor 

If this gives ondemand , powersave or userspace , then scaling the processor frequency is active, and it will be very difficult for you to do tests. If it says performance , then CPU frequency scaling is disabled.

+1
source

Some processors have on-chip optimizations that study the path that your code normally uses. Having predicted what the next if statement will do, there is no need to abandon the queue and freshly load all new operations from scratch. Depending on the chip and algorithm, this can take 5 to 10 cycles until it successfully predicts if statements. But for some reason there are also reasons that speak against this as the reason for such behavior.

Looking at your conclusion, I would say that it could just be planning a system and / or processor frequency controller. Are you sure that the processor frequency does not change during the execution of your code? No CPU boost? Using linux tools like cpufreq are often used to control the frequency of the processor.

0
source

Hyper-threading means register space replication, not the actual decoding / execution blocks, so this is not a solution.

To check the accuracy of the microadvertising method, I would do the following:

  • Run the high priority program
  • Count the number of instructions to make sure they are correct. I would do this using perf stat./binary - that means you need to have perf. I would do this several times and look at the clock metrics and instructions to see how several instructions can be executed in one cycle.

I have a few additional comments :

For each nop, you must also compare the conditional branch in a for loop. If you really want to do NOP, I would write an instruction like this:

 #define NOP5 __asm__ __volatile__ ("nop nop nop nop nop"); #define NOP25 NOP5 NOP5 NOP5 NOP5 NOP5 #define NOP100 NOP25 NOP25 NOP25 NOP25 #define NOP500 NOP100 NOP100 NOP100 NOP100 NOP100 ... for(int i=0; i<100000000; i++) { NOP500 NOP500 NOP500 NOP500 } 

This design will allow you to actually perform NOPs instead of comparing I with 100M .

-1
source

Source: https://habr.com/ru/post/978528/


All Articles