Where SSE instructions outperform regular instructions

Question

Where SSE instructions outperform regular instructions

Where x86-64 SSE instructions (vector instructions) are superior to regular instructions. Because what I see is that the frequent workloads and storages needed to execute the SSE instructions negate any gain that we got due to vectorial computation. So can anyone give me an example of SSE code where it works better than regular code.

Perhaps because I pass each parameter separately, for example ...

__m128i a = _mm_set_epi32(pa[0], pa[1], pa[2], pa[3]); __m128i b = _mm_set_epi32(pb[0], pb[1], pb[2], pb[3]); __m128i res = _mm_add_epi32(a, b); for( i = 0; i < 4; i++ ) po[i] = res.m128i_i32[i];

Is there no way to transfer all 4 integers at a time, I mean transfer all 128 bytes of pa at a time? And assign res.m128i_i32 to po at a time?

+6

c x86-64 sse

pythonic Apr 25 '12 at 10:01

source share

1 answer

Mysticial · Accepted Answer · 2012-04-25T10:48:12+0000

Summing up comments in response:

Basically, you fall into the same trap that most freshmen catch. There are two problems in your example:

You are abusing _mm_set_epi32() .
You have a very low calculation / load-storage ratio. (1 to 3 in your example).

_mm_set_epi32() is a very expensive internal. Although it is convenient to use, it does not compile into a single instruction. Some compilers (e.g. VS2010) can generate very bad code when using _mm_set_epi32() .

Instead, since you are loading contiguous blocks of memory, you should use _mm_load_si128() . This requires the pointer to be aligned with 16 bytes. If you cannot guarantee this alignment, you can use _mm_loadu_si128() - but with limited performance. Ideally, you should align your data correctly so as not to resort to using _mm_loadu_si128() .

Truly effective with SSE, you will also want to maximize the calculation / load-storage ratio. The goal I am shooting for is 3 to 4 arithmetic instructions for accessing memory. This is a pretty high ratio. As a rule, you need to reorganize the code or redesign the algorithm to increase it. Combining passages over data is a common approach.

Pivot frequency is often needed to maximize performance when you have large cycle bodies with long chains of dependencies.

Some examples of SO questions that successfully use SSE to speed up.

C code code performance (without vectorization)
C code performance [continued] (vector)
How to reach a theoretical maximum of 4 FLOP per cycle? (a contrived example to achieve maximum processor performance)

Where SSE instructions outperform regular instructions

More articles: