Unrelated access is much slower than consistent access (at least until Nehalem); you can get a higher speed by loading aligned 128-bit words that contain the desired irregular 64-bit words, then shuffle them to make the desired result.
Assumes:
- you have memory access for reading up to 128 words
- 64-bit words aligned at least at 32-bit boundaries
eg. (not verified)
int aoff = ptra & 15; int boff = ptrb & 15; __m128 va = _mm_load_ps( (char*)ptra - aoff ); __m128 vb = _mm_load_ps( (char*)ptrb - boff ); switch ( (aoff<<4) | boff ) { case 0: _mm_shuffle_ps(va,vb, ...
The number of cases depends on whether you can perform 64 bit alignment
source share