Using SSE Instructions

I have a loop written in C ++ that runs for each element of a large integer array. Inside the loop, I mask some bits of an integer, and then I find the values ​​of min and max. I heard that if I use SSE instructions for these operations, it will work much faster compared to a regular loop written using bitwise AND and if-else conditions. My question is should I go for these SSE instructions? Also, what happens if my code runs on a different processor? Will it work, or are these instructions processor dependent?

+25
c ++ optimization assembly sse processor
Feb 25 '09 at 15:55
source share
15 answers
  • SSE instructions are processor specific. You can see which processor supports the SSE version on wikipedia.
  • If the SSE code is faster or independent of many factors: firstly, of course, is there a problem with memory binding or processor binding. If the memory bus is an SSE bottleneck, this will not help. Try simplifying the calculation of integers if this speeds up the code, possibly related to the CPU, and you have a good chance of speeding it up.
  • Keep in mind that writing SIMD code is much more difficult than writing C ++ code, and that the resulting code is much more difficult to change. Always update C ++ code, you want it to be like a comment, and to check the correctness of the assembler code.
  • Consider using a library such as IPP that implements the usual low-level SIMD operations optimized for different processors.
+23
Feb 25 '09 at 16:09
source share

SIMD, an example of which is SSE, allows you to perform the same operation on multiple pieces of data. Thus, you will not get any benefit from using SSE as a direct replacement for integer operations, you will only get benefits if you can perform operations on several data items at the same time. This involves loading some data values ​​that are contiguous in memory, perform the required processing, and then move on to the next set of values ​​in the array.

Problems:

1 If the code path depends on the data being processed, SIMD becomes much more difficult to implement. For example:

a = array [index]; a &= mask; a >>= shift; if (a < somevalue) { a += 2; array [index] = a; } ++index; 

not as easy to do as SIMD:

 a1 = array [index] a2 = array [index+1] a3 = array [index+2] a4 = array [index+3] a1 &= mask a2 &= mask a3 &= mask a4 &= mask a1 >>= shift a2 >>= shift a3 >>= shift a4 >>= shift if (a1<somevalue) if (a2<somevalue) if (a3<somevalue) if (a4<somevalue) // help! can't conditionally perform this on each column, all columns must do the same thing index += 4 

2 If the data is not contiguous, then loading data into a SIMD instruction is cumbersome

3 Code specific to the processor. SSE only works on IA32 (Intel / AMD), and not all SSE support is with IS32 support.

You need to analyze the algorithm and data to find out if it can be SSE'd, and this requires knowledge of how SSE works. The Intel website has a lot of documentation.

+14
Feb 25 '09 at 16:24
source share

This problem is a great example of where a good low-level profiler is needed. (Something like VTune) This can give you a much more informed picture of where your hot spots are.

My assumption, from what you are describing, is that your access point is likely to be a branch prediction error resulting from min / max calculations using if / else. Therefore, using the built-in SIMD functions, you should use the min / max instructions, however, it might be worth a try using the unallocated min / max caluculation instead. This can provide most of the wins with less pain.

Something like that:

 inline int minimum(int a, int b) { int mask = (a - b) >> 31; return ((a & mask) | (b & ~mask)); } 
+10
Feb 26 '09 at 16:19
source share

If you use SSE instructions, you are obviously limited to the processors that support them. This means that x86 related to Pentium 2 or so (I do not remember exactly when they were introduced, but it was a long time ago)

SSE2, which, as I recall, is one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors did not support them)

In any case, you have two options for using these instructions. Or write the entire block of code in the assembly (maybe this is a bad idea. This makes it almost impossible for the compiler to optimize your code, and it’s very difficult for a person to write an efficient assembler).

Alternatively, use the built-in functions available with your compiler (if memory is used, they are usually defined in xmmintrin.h)

But again, performance may not improve. The SSE code creates additional requirements for the data it processes. Basically, you need to keep in mind that data should be aligned at 128-bit boundaries. There should also be few or no dependencies between values ​​loaded into the same register (a 128-bit SSE register can contain 4 intervals. Adding the first and second together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)

It may be tempting to use a library that wraps all low-level SSE scripts, but it can also ruin any potential benefits.

I don’t know how well the whole operation is supported by SSE, so this can also be a factor that can limit performance. SSE is mainly aimed at speeding up floating point operations.

+6
Feb 25 '09 at 16:15
source share

If you intend to use Microsoft Visual C ++, you should read the following:

http://www.codeproject.com/KB/recipes/sseintro.aspx

+4
Feb 25 '09 at 16:19
source share

We have implemented some image processing code, similar to what you are describing, but an array of bytes, in SSE. The acceleration compared to the C-code is significantly, depending on the exact algorithm, more than 4 times, even with respect to the Intel compiler. However, as you already mentioned, you have the following disadvantages:

  • portability. The code will work on every Intel processor, such as AMD, but not on other processors. This is not a problem for us, because we control the target equipment. The problem may be switching compilers and even to a 64-bit OS.

  • You have a steep learning curve, but I found that after understanding the principles of writing new algorithms it is not so difficult.

  • maintainability

    . Most C or C ++ programmers do not know assembly / SSE.

My advice to you is to go for it only if you really need to improve performance, and you cannot find a feature for your problem in a library such as Intel IPP, and if you can live with portability problems.

+3
Feb 25 '09 at 16:16
source share

I can say from my experience that SSE brings tremendous (4 times or more) speed compared to a simple version of c code (without built-in asm, without using built-in tools), but assembler with manual optimization can beat the assembly created by the compiler if the compiler cannot understand what the programmer intended (believe me, compilers do not cover all possible combinations of code, and they never will). Oh, and, the compiler cannot compose the data that it launches at the highest possible speed every time. But you need a lot of experience to speed up work on the Intel compiler (if possible).

+3
Jul 10 '10 at 22:11
source share

SSE instructions were originally only on Intel chips, but recently (with Athlon?) AMD also supports them, so if you are creating code with a set of SSE instructions, you should be portable for most x86 processes.

At the same time, you should not waste time studying SSE coding if you are not already familiar with x86 assembler - the easiest way is to check your compiler documents and see if there are options that allow the compiler to generate SSE auto-generating code for you. Some compilers do very well vectorize loops this way. (You are probably not surprised to hear that Intel compilers do this well :)

+2
Feb 25 '09 at 16:12
source share

Enter code to help the compiler understand what you are doing. GCC will understand and optimize SSE code, for example:

 typedef union Vector4f { // Easy constructor, defaulted to black/0 vector Vector4f(float a = 0, float b = 0, float c = 0, float d = 1.0f): X(a), Y(b), Z(c), W(d) { } // Cast operator, for [] inline operator float* () { return (float*)this; } // Const ast operator, for const [] inline operator const float* () const { return (const float*)this; } // ---------------------------------------- // inline Vector4f operator += (const Vector4f &v) { for(int i=0; i<4; ++i) (*this)[i] += v[i]; return *this; } inline Vector4f operator += (float t) { for(int i=0; i<4; ++i) (*this)[i] += t; return *this; } // Vertex / Vector // Lower case xyzw components struct { float x, y, z; float w; }; // Upper case XYZW components struct { float X, Y, Z; float W; }; }; 

Just remember to have -msse -msse2 for your build options!

+2
Feb 27 '09 at 8:44
source share

Although it is true that SSE is specific to some processors (SSE can be relatively safe, SSE2 is much less in my experience), you can detect the processor at runtime and dynamically load the code depending on the target CPU.

+1
Feb 25 '09 at 16:31
source share

SIMD initiatives (such as SSE2) can speed up this kind of work, but use the experience for proper use. They are very sensitive to alignment and pipe lag; Careless use can make performance worse than without them. You will get much easier and faster acceleration simply by prefetching the cache to make sure all your functions are in L1 so you can work with them.

If your function does not need a bandwidth of more than 100,000,000 integers per second, SIMD is probably not a problem for you.

+1
Feb 26 '09 at 8:43
source share

I just add briefly what was said earlier that different versions of SSE are available on different processors: this can be checked by looking at the corresponding function flags returned by the CPUID instruction (see, for example, Intel documentation for details).

+1
Feb 26 '09 at 11:49
source share

Look at the inline assembler for C / C ++, here is the DDJ article . If you are not 100% sure that your program will work on a compatible platform, you should follow the recommendations that many have given here.

+1
Feb 26 '09 at 12:01
source share

I agree with the previous posters. The benefits can be quite large, but it can take a lot of work to get it. Intel documentation for these instructions is more than 4 thousand pages. You might want to check out EasySSE (C ++ cover library for intrinsics + examples) without Ocali Inc.

I believe that my belonging to this EasySSE is understandable.

+1
Nov 29 '11 at 20:07
source share

I do not recommend doing this yourself if you are not good at assembly. Using an SSE is likely to require a thorough reorganization of your data, as Skizz points out, and an advantage at best might be in doubt.

You will probably be much better off writing very small loops and organizing your data very carefully and just rely on the compiler to do this for you. Both Intel C Compiler and GCC (since version 4.1) can automatically vectorize your code and will probably work better than you. (Just add -tree-vectorize to your CXXFLAGS.)

Change Another thing I should mention is that several compilers support assembly assemblers, which are likely to be more usable than the asm () or __asm ​​{} syntax.

0
Feb 25 '09 at 18:01
source share



All Articles