Summing up comments in response:
Basically, you fall into the same trap that most freshmen catch. There are two problems in your example:
- You are abusing
_mm_set_epi32()
. - You have a very low calculation / load-storage ratio. (1 to 3 in your example).
_mm_set_epi32()
is a very expensive internal. Although it is convenient to use, it does not compile into a single instruction. Some compilers (e.g. VS2010) can generate very bad code when using _mm_set_epi32()
.
Instead, since you are loading contiguous blocks of memory, you should use _mm_load_si128()
. This requires the pointer to be aligned with 16 bytes. If you cannot guarantee this alignment, you can use _mm_loadu_si128()
- but with limited performance. Ideally, you should align your data correctly so as not to resort to using _mm_loadu_si128()
.
Truly effective with SSE, you will also want to maximize the calculation / load-storage ratio. The goal I am shooting for is 3 to 4 arithmetic instructions for accessing memory. This is a pretty high ratio. As a rule, you need to reorganize the code or redesign the algorithm to increase it. Combining passages over data is a common approach.
Pivot frequency is often needed to maximize performance when you have large cycle bodies with long chains of dependencies.
Some examples of SO questions that successfully use SSE to speed up.
source share