I tried to pass the code from FANN Lib (network of neurons written in C) to SSE2. But SSE2 performance has deteriorated than regular code. When running my SSE2 implementation, one run takes 5.50 minutes without 5.20 minutes.
How can SSE2 be slower than usual? Maybe because of _mm_set_ps ? I use the Apple LLVM compiler (Xcode 4) to compile the code (all SSE extension flags are enabled, the optimization level is -Os ).
Code without SSE2
neuron_sum += fann_mult(weights[i], neurons[i].value) + fann_mult(weights[i + 1], neurons[i + 1].value) + fann_mult(weights[i + 2], neurons[i + 2].value) + fann_mult(weights[i + 3], neurons[i + 3].value);
SSE2 Code
__m128 a_line=_mm_loadu_ps(&weights[i]); __m128 b_line=_mm_set_ps(neurons[i+3].value,neurons[i+2].value,neurons[i+1].value,neurons[i].value); __m128 c_line=_mm_mul_ps(a_line, b_line); neuron_sum+=c_line[0]+c_line[1]+c_line[2]+c_line[3];
source share