If you use SSE instructions, you are obviously limited to the processors that support them. This means that x86 related to Pentium 2 or so (I do not remember exactly when they were introduced, but it was a long time ago)
SSE2, which, as I recall, is one that offers integer operations, is somewhat more recent (Pentium 3? Although the first AMD Athlon processors did not support them)
In any case, you have two options for using these instructions. Or write the entire block of code in the assembly (maybe this is a bad idea. This makes it almost impossible for the compiler to optimize your code, and it’s very difficult for a person to write an efficient assembler).
Alternatively, use the built-in functions available with your compiler (if memory is used, they are usually defined in xmmintrin.h)
But again, performance may not improve. The SSE code creates additional requirements for the data it processes. Basically, you need to keep in mind that data should be aligned at 128-bit boundaries. There should also be few or no dependencies between values loaded into the same register (a 128-bit SSE register can contain 4 intervals. Adding the first and second together is not optimal. But adding all four ints to the corresponding 4 ints in another register will be fast)
It may be tempting to use a library that wraps all low-level SSE scripts, but it can also ruin any potential benefits.
I don’t know how well the whole operation is supported by SSE, so this can also be a factor that can limit performance. SSE is mainly aimed at speeding up floating point operations.
jalf Feb 25 '09 at 16:15 2009-02-25 16:15
source share