, , Intel , parallelism.
, . Core2 Broadwell:
Core2: two 16 byte reads one 16 byte write per 2 clock cycles -> 24 bytes/clock cycle
SB/IB: two 32 byte reads and one 32 byte write per 2 clock cycles -> 48 bytes/clock cycle
HSW/BDW: two 32 byte reads and one 32 byte write per clock cycle -> 96 bytes/clock cycle
sizeof(double)*100*3=2400
. , , -
Core2: 2400/24 = 100 clock cycles
SB/IB: 2400/48 = 50 clock cycles
HSW/BDW: 2400/96 = 25 clock cycles
, .
Core2 Ivy . . , . Nehalem, -, / :
Core2 Nehalem through Broadwell
vector add + load 1 1
vector load 1 1
vector store 1 1
scalar add 1 ½
conditional jump 1 ½
--------------------------------------------
total 5 4
Core2 Ivy , , . . / . - 7 , 32- + ( , OSX). , Haswell/Broadwell, , , , 1,5 . :
Core2: 5 fused micro-ops/every two clock cycles
SB/IB: 4 fused micro-ops/every two clock cycles
HSW/BDW: 4 fused mirco-ops/every clock cycle for statically allocated array
HSW/BDW: 4 fused mirco-ops/every 1.5 clock cycles for non-statically allocated arrays
, , , , . SIMD. :
SSE2: (100+1)/2 = 51
AVX: (100+3)/4 = 26
Intel , . :
SSE2: (100+3)/4 = 26
AVX: (100+7)/8 = 13
,
Core2: 51*2 = 102 clock cycles
SB/IB: 26*2 = 51 clock cycles
HSW/BDW: 26*1.5 = 39 clock cycles for non-statically allocated arrays no-unroll
HSW/BDW: 26*1 = 26 clock cycles for statically allocated arrays no-unroll
HSW/BDW: 26*1 = 26 clock cycles with full unrolling