Intel Auto-Vectorization Trip Count Explanation?

I have done quite a bit of the thread level and the process level of parallelism, and now I am trying to move to the parallelism instruction level using the Intel C ++ compiler, which is quite a challenge.

Performing some automatic loop vector and analyzing compiler logs, I found some “Estimating the maximum number of loop operations” that I cannot understand.

Example:

double a[100],x[100],y[100]
...
for (i=0; i< 100; i++) {
   a[i] = x[i] + y[i];
}

This cycle gives an estimate of the maximum number of outages during 12 outages. I read somewhere that the vectorization process can process a total of 8 elements per trip, since the cost of the process of each cycle is less than 6 u-operations, from what I can say, this approximate cycle has the cost of 1 store, 2 readings and 1 arithmetic operation.

So, in theory, my trip counter should be 100/8 = 12.5 trips, so 13 trips.

Is this rounding done by the compiler? Or is there any other optimization in the background that allows the process to accept less than 13 outages?

One more question: are my 6-operations correct per cycle? Are there cases where this does not apply?

Thank you in advance

+4
2

, , Intel , parallelism.

, . Core2 Broadwell:

Core2:   two 16 byte reads one 16 byte write per 2 clock cycles     -> 24 bytes/clock cycle
SB/IB:   two 32 byte reads and one 32 byte write per 2 clock cycles -> 48 bytes/clock cycle
HSW/BDW: two 32 byte reads and one 32 byte write per clock cycle    -> 96 bytes/clock cycle

sizeof(double)*100*3=2400. , , -

Core2:   2400/24 = 100 clock cycles
SB/IB:   2400/48 =  50 clock cycles
HSW/BDW: 2400/96 =  25 clock cycles

, .

Core2 Ivy ​​ . . , . Nehalem, -, / :

                            Core2          Nehalem through Broadwell
vector add + load               1          1
vector load                     1          1
vector store                    1          1
scalar add                      1          ½
conditional jump                1          ½  
--------------------------------------------
total                           5          4

Core2 Ivy , , . . / . - 7 , 32- + ( , OSX). , Haswell/Broadwell, , , , 1,5 . :

Core2:   5 fused micro-ops/every two clock cycles
SB/IB:   4 fused micro-ops/every two clock cycles
HSW/BDW: 4 fused mirco-ops/every clock cycle for statically allocated array
HSW/BDW: 4 fused mirco-ops/every 1.5 clock cycles for non-statically allocated arrays

, , , , . SIMD. :

SSE2: (100+1)/2 = 51
AVX:  (100+3)/4 = 26

Intel , . :

SSE2: (100+3)/4 = 26
AVX:  (100+7)/8 = 13

,

Core2:     51*2   = 102 clock cycles
SB/IB:     26*2   =  51 clock cycles
HSW/BDW:   26*1.5 =  39 clock cycles for non-statically allocated arrays no-unroll
HSW/BDW:   26*1   =  26 clock cycles for statically allocated arrays no-unroll
HSW/BDW:   26*1   =  26 clock cycles with full unrolling
+3

6-uops , . Intel , . .

, 8 , AVX 4 256- ymm.

. 8 , 12, 13, ( 8 ) .

, :

int i=0;
for(; i<(100 & ~7); i+=8) // 12 iterations
    // Do vector code

for(;i<100; ++i)
    // Process loop remainder using scalar code
0

Source: https://habr.com/ru/post/1612918/


All Articles