Is your application structured in such a way that you can profitably vectorize this function by computing several independent values โโin parallel? This will allow you to use hardware vector units such as SSE.
It also seems like you're better off not supporting 1/r values, rather than r .
This is an example that explicitly uses SSE2 instructions to implement the function. ljd() calculates two values โโat once.
static __m128d ljd(__m128d r) { static const __m128d two = { 2.0, 2.0 }; static const __m128d si = { Si, Si }; static const __m128d ep24 = { 24 * Ep, 24 * Ep }; __m128d temp2, temp3; __m128d temp = _mm_div_pd(si, r); __m128d ep24_r = _mm_div_pd(ep24, r); temp = _mm_mul_pd(temp, temp); temp2 = _mm_mul_pd(temp, temp); temp2 = _mm_mul_pd(temp2, temp); temp3 = _mm_mul_pd(temp2, temp2); temp3 = _mm_mul_pd(temp3, two); return _mm_mul_pd(ep24_r, _mm_sub_pd(temp2, temp3)); } void ljd_array(double out[], const double in[], int n) { int i; for (i = 0; i < n; i += 2) { _mm_store_pd(out + i, ljd(_mm_load_pd(in + i))); } }
However, it is important to note that the latest versions of GCC can often automatically perform functions such as this automatically if you aim for the right architecture and turn on optimization. If you are targeting 32-bit x86, try compiling with -msse2 -O3 and adjust the settings so that the input and output arrays are aligned by 16 bytes.
Alignment for static and automatic arrays can be achieved with gcc with the attribute type __attribute__ ((aligned (16))) , and for dynamic arrays using the posix_memalign() function.
source share