Using SSE in floating point pixels with only 3 color components

I am creating a structure to store one RGB image in an image.

struct Pixel { // color values range from 0.0 to 1.0 float r, g, b; }__attribute__((aligned(16)); 

I want to use 128-bit SSE instructions to perform tasks such as adding, multiplying, etc. Thus, I can perform operations on all 3 color channels at the same time. So, the first packed float in my SSE register will be red, then green, then blue, but I'm not sure what will go into my fourth register. I really don't care which bits are in the extra 32 padding bits. When I load a pixel into the SSE register, I would assume that it contains either zeros or the values โ€‹โ€‹of unwanted messages. Is this problematic? Should I add a fourth alpha channel, although I really don't need it? The only way I can see that this is the problem is that I was dividing by pixel, and in fourth place was zero, or I took the root of the negative, etc.

+5
source share
1 answer

Integer operating systems will not have any problems with uninitialized values, since the delay never depends on the data. Floating point is different. Some FPUs slow down denormals, NaNs, and infinity (in any of the vector elements).

Intel Nehalem and earlier significantly slow down when performing mathematical operations with denormal inputs / outputs, as well as when overflowing / overflowing FP. Sandybridge has a good FPU with fast add / subcomponent for any inputs (according to the Agner Fog instruction table ), but multiplication can be slowed down .

Add / sub / multiply is fine with zeros, but there could potentially be a problem with uninitialized junk that could represent NaN or something else.

Be careful with the division, which you do not divide by zero. This may even cause FPU exception, depending on the settings of the HW.

So yes, preserving the zero of an unused element is probably a good idea. Depending on how you generate things in the first place, this can be quite cheap. (for example, movd / pinsrd / pinsrd (or insertps) to put three 32-bit elements into a vector, with the initial movement nullifying a high 96b.)

A workaround would be to save the second copy of the blue channel in the 4th element. (or whatโ€™s most convenient for shuffling). You can load vectors using movsldup (SSE3) / movlps . After movsldup your register will contain { bbrr } . movlps will reload the lower 64 bits, so you will have { bbgr } . (This is equivalent to movsd , BTW.) Or, if the mixing port is less busy than the load ports, run one 16B load and then shufps. ( movsldup on Intel processors is one uop that runs on the boot port, even if it has built-in duplication.)

Another option is to pack your pixels into 12 bytes, so loading 16B will get one component of the next pixel. Depending on what you are doing, overlapping repositories that knock down one element of the next pixel may or may not be in order. Loading the next pixel before saving the current one may work around this for some operating systems. It is quite simple to be limited by cache or bandwidth, so saving 1/4 of the space with the small cost of accidentally loading / storing in the cache line may be worth it.

+9
source

Source: https://habr.com/ru/post/1232899/


All Articles