Segmentation error when working with embedded SSEs due to improper memory alignment

Question

Segmentation error when working with embedded SSEs due to improper memory alignment

This is the first time I've been working with SSE's built-in features, and I ran into the problem of segmentation even after ensuring memory alignment of 16 bytes. This post is a continuation of my previous question:

How to allocate 16-byte data with memory alignment

This is how I declared my array:

float *V = (float*) memalign(16,dx*sizeof(float));

When I try to do this:

  __m128 v_i = _mm_load_ps(&V[i]); //It works

But when I do this:

  __m128 u1 = _mm_load_ps(&V[(i-1)]); //There is a segmentation fault

But if I do this:

  __m128 u1 = _mm_loadu_ps(&V[(i-1)]); //It works again

However, I want to exclude the use of _mm_loadu_ps and want it to work only with _mm_load_ps .

I am working with Intel icc compiler.

How to solve this problem?

UPDATE:

using both operations in the following code:

  void FDTD_base (float *V, float *U, int dx, float c0, float c1, float c2, float c3, float c4) { int i, j, k; for (i = 4; i < dx-4; i++) { U[i] = (c0 * (V[i]) //center + c1 * (V[(i-1)] + V[(i+1)] ) + c2 * (V[(i-2)] + V[(i+2)] ) + c3 * (V[(i-3)] + V[(i+3)] ) + c4 * (V[(i-4)] + V[(i+4)] )); } }

SSE Version:

  for (i=4; i < dx-4; i+=4) { v_i = _mm_load_ps(&V[i]); __m128 center = _mm_mul_ps(v_i,c0_i); __m128 u1 = _mm_loadu_ps(&V[(i-1)]); u2 = _mm_loadu_ps(&V[(i+1)]); u3 = _mm_loadu_ps(&V[(i-2)]); u4 = _mm_loadu_ps(&V[(i+2)]); u5 = _mm_loadu_ps(&V[(i-3)]); u6 = _mm_loadu_ps(&V[(i+3)]); u7 = _mm_load_ps(&V[(i-4)]); u8 = _mm_load_ps(&V[(i+4)]); __m128 tmp1 = _mm_add_ps(u1,u2); __m128 tmp2 = _mm_add_ps(u3,u4); __m128 tmp3 = _mm_add_ps(u5,u6); __m128 tmp4 = _mm_add_ps(u7,u8); __m128 tmp5 = _mm_mul_ps(tmp1,c1_i); __m128 tmp6 = _mm_mul_ps(tmp2,c2_i); __m128 tmp7 = _mm_mul_ps(tmp3,c3_i); __m128 tmp8 = _mm_mul_ps(tmp4,c4_i); __m128 tmp9 = _mm_add_ps(tmp5,tmp6); __m128 tmp10 = _mm_add_ps(tmp7,tmp8); __m128 tmp11 = _mm_add_ps(tmp9,tmp10); __m128 tmp12 = _mm_add_ps(center,tmp11); _mm_store_ps(&U[i], tmp12); }

Is there a more efficient way to do this using only _mm_load_ps() ?

+6

c memory sse icc

PGOnTheGo Jun 18 '12 at 15:22

source share

1 answer

Pedro · Accepted Answer · 2012-06-18T15:29:34+0000

Since sizeof(float) is 4, only every fourth entry in V will be correctly aligned. Remember that _mm_load_ps loads four floats at a time. The argument, that is, the pointer to the first float, must be aligned with 16 bytes.

I assume that in your example i will be a multiple of four, otherwise _mm_load_ps(&V[i]) will fail.

Update

Here's how I would suggest implementing the above window example using aligned loads and tasses:

 __m128 v_im1; __m128 v_i = _mm_load_ps( &V[0] ); __m128 v_ip1 = _mm_load_ps( &V[4] ); for ( i = 4 ; i < dx ; i += 4 ) { /* Get the three vectors in this 'frame'. */ v_im1 = v_i; v_i = v_ip1; v_ip1 = _mm_load_ps( &V[i+4] ); /* Get the u1..u8 from the example code. */ __m128 u3 = _mm_shuffle_ps( v_im1 , v_i , 3 + (4<<2) + (0<<4) + (1<<6) ); __m128 u4 = _mm_shuffle_ps( v_i , v_ip1 , 3 + (4<<2) + (0<<4) + (1<<6) ); __m128 u1 = _mm_shuffle_ps( u3 , v_i , 1 + (2<<2) + (1<<4) + (2<<6) ); __m128 u2 = _mm_shuffle_ps( v_i , u4 , 1 + (2<<2) + (1<<4) + (2<<6) ); __m128 u5 = _mm_shuffle_ps( v_im1 , u3 , 1 + (2<<2) + (1<<4) + (2<<6) ); __m128 u6 = _mm_shuffle_ps( u4 , v_ip1 , 1 + (2<<2) + (1<<4) + (2<<6) ); __m128 u7 = v_im1; __m128 u8 = v_ip1; /* Do your computation and store. */ ... }

Note that this is a bit complicated, because _mm_shuffle_ps can only take two values from each argument, so we first need to make u3 and u4 to reuse them for other values using different overlays.

Note that the values u1 , u3 and u5 can also be recovered from u2 , u4 and u6 in the previous iteration.

Please note, finally, that I did not check the code above! Read the documentation for _mm_shuffle_ps and verify that the third argument, the selector, is correct for each case.

Segmentation error when working with embedded SSEs due to improper memory alignment

More articles: