I have a simple loop that I wrote in C ++, since I would like to profile the multiply command performance on my processor. I found some interesting nuances in the assembly code that was generated when I profiled it.
Here is the C ++ program:
#define TESTS 10000000
#define BUFSIZE 1000
uint32_t buf_in1[BUFSIZE];
uint32_t buf_in2[BUFSIZE];
uint32_t volatile buf_out[BUFSIZE];
unsigned int i, j;
for (i = 0; i < BUFSIZE; i++) {
buf_in1[i] = i;
buf_in2[i] = i;
}
for (j = 0; j < TESTS; j++) {
for (i = 0; i < BUFSIZE; i++) {
buf_out[i] = buf_in1[i] * buf_in2[i];
}
}
I compiled with the following flags:
Optimization:

Code Generation:

It compiled in visual studio 2012 under Win32, although I run it on a 64-bit machine.
Note the mutable classifier on buf_out. This is just to stop the compiler from optimizing the loop.
(AMD CodeXL), , . 30% imul, 60% :

, Timer , . 1 , 2609 2609 , .
, , , - mov, jb (, ).
mov,
mov [esp+eax+00001f40h],ecx
(ecx) buf_out eax ( , i). , , mov? .. :
mov ecx,[esp+eax+00000fa0h]
, 1000 uint32_t 4000 . 4000 * 3 = 12 . L1 - 64 , L1, ...
, .. Coreinfo:

:
jb $-1ah (0x903732)
, 33% . 64 , 0x1A 26 . , , 64- ? (0x903740 - 64 )
- ?
.