Slow instructions in a simple x86 loop

I have a simple loop that I wrote in C ++, since I would like to profile the multiply command performance on my processor. I found some interesting nuances in the assembly code that was generated when I profiled it.

Here is the C ++ program:

#define TESTS 10000000
#define BUFSIZE 1000
uint32_t buf_in1[BUFSIZE];
uint32_t buf_in2[BUFSIZE];
uint32_t volatile buf_out[BUFSIZE];

unsigned int i, j;

for (i = 0; i < BUFSIZE; i++) {
    buf_in1[i] = i;
    buf_in2[i] = i;
}

for (j = 0; j < TESTS; j++) {
    for (i = 0; i < BUFSIZE; i++) {
        buf_out[i] = buf_in1[i] * buf_in2[i];
    }
}

I compiled with the following flags:

Optimization: Optimization

Code Generation:

Code generation

It compiled in visual studio 2012 under Win32, although I run it on a 64-bit machine.

Note the mutable classifier on buf_out. This is just to stop the compiler from optimizing the loop.

(AMD CodeXL), , . 30% imul, 60% :

Profiler

, Timer , . 1 , 2609 2609 , .

, , , - mov, jb (, ).

mov,

mov [esp+eax+00001f40h],ecx

(ecx) buf_out eax ( , i). , , mov? .. :

mov ecx,[esp+eax+00000fa0h]

, 1000 uint32_t 4000 . 4000 * 3 = 12 . L1 - 64 , L1, ...

, .. Coreinfo:

Coreinfo

:

jb $-1ah (0x903732)

, 33% . 64 , 0x1A 26 . , , 64- ? (0x903740 - 64 )

- ?

.

+4
3

Mystical, , , , .

(imul add 4 eax , , mov, ALU imul ).

, , - , , , - , .

, , . .

, CPU , , .

+2

, L1, , , - , CPU ( , ).

, : . , , . .

, .

+1

Unfortunately, you did not specify the time required for one pass through your cycle, but I assume that these are three processor cycles. If so, the three instructions that get the time attributed to them are the three instructions that the processor is officially on when the clock is ticking. The remaining three instructions are executed in parallel with the three official time instructions hiding behind them.

0
source

Source: https://habr.com/ru/post/1539144/


All Articles