Limit the 16-bit character value between 0 and 4095 using only bit manipulation (no branching)

I want to limit the value of the signed short variable between 0 and 4095, after which I accept the most important 8 bits as my final value for use elsewhere. Now I am doing it as follows:

 short color = /* some external source */; /* * I get the color value as a 16 bit signed integer from an * external source I cannot trust. 16 bits are being used here * for higher precision. */ if ( color < 0 ) { color = 0; } else if ( color > 4095 ) { color = 4095; } unsigned char color8bit = 0xFF & (color >> 4); /* * color8bit is my final value which I would actually use * in my application. */ 

Is it possible to do this using bit-manipulation, i.e. without using any conventions? This can help speed things up a bit, as this operation takes place thousands of times in code.

The following will not help, since it does not care about cross cases such as negative values ​​and overflows:

 unsigned char color8bit = 0xFF & (( 0x0FFF & color ) >> 4 ); 

Edit: Adam Rosenfield's answer is one that takes the right approach, but it is not implemented correctly. ouah answer gives the correct results, but takes a different approach that I originally wanted to find out.

Here is what I ended up using:

 const static short min = 0; const static short max = 4095; color = min ^ (( min ^ color ) & -( min < color )); color = max ^ (( color ^ max ) & -( color < max )); unsigned char color8bit = 0xFF & (( 0x0FFF & color ) >> 4 ); 
+4
source share
7 answers
 short color = /* ... */ color = ((((!!(color >> 12)) * 0xFFF)) | (!(color >> 12) * color )) & (!(color >> 15) * 0xFFF); unsigned char color8bit = 0xFF & (color >> 4); 

He assumes the presence of two additional representations.

This has the advantage that no equality or relational operators are used. There are situations when you want to avoid branches at all costs: in some security applications, you do not want attackers to perform branch predictions. Without branches (especially in embedded processors), you can run your function at a constant time for all inputs.

Please note that: x * 0xFFF can be further reduced to (x << 12) - x . Also, multiplication in (!(color >> 12) * color ) can also be further optimized, since the left operand * is 0 or 1 .

EDIT:

I add a little explanation: the expression above just does the same as below, without using conditional and relational operators:

 y = ((y > 4095 ? 4095 : 0) | (y > 4095 ? 0 : y)) & (y < 0 ? 0 : 4095); 

EDIT2:

as @HotLicks correctly noted in his comment,! still a conceptual branch. However, it can also be computed using bitwise operators. For example, !!a can be done using the trivial:

 b = (a >> 15 | a >> 14 | ... | a >> 1 | a) & 1 

and !a can be executed as b ^ 1 . And I'm sure there is a good hack to do this more efficiently.

+2
source

Yes, see these bit-twisting hacks :

 short color = ...; color = color ^ (color & -(color < 0)); // color = max(color, 0) color = 4096 ^ ((color ^ 4096) & -(color < 4096)); // color = min(color, 4096) unsigned char color8bit = 0xFF & (color >> 4); 

Whether this is actually faster, I do not know you need a profile. Most modern x86 and x86-64 chips nowadays support conditional move (cmov) instructions that conditionally store a value depending on the EFLAGS status bit, and optimizing compilers often results in these instructions from ternary expressions like color >= 0 ? color : 0 color >= 0 ? color : 0 . They will probably be the fastest, but they will not work on older x86 chips.

+7
source

You can do the following:

 BYTE data[0x10000] = { ..... }; BYTE byte_color = data[(unsiged short)short_color]; 

Nowadays, a 64kb table is not something outrageous and may be acceptable. The number of assembler instructions in this version of the code will be an absolute minimum compared to other possible approaches.

+5
source

I assume short is 16 bits.

Remove negative values:

 int16_t mask=-(int16_t)((uint16_t)color>>15);//0xFFFF if +ve, 0 if -ve short value=color&mask;//0 if -ve, colour if +ve 

value now between 0 and 32767 inclusive.

Then you can do something like this to pin the value:

 mask=(uint16_t)(value-4096)>>15;//1 if <=4095, 0 if >4095 --mask;//0 if <=4095, 0xFFFF if >4095 mask&=0xFFF;//0 if <=4095, 4095 if >4095 value|=mask;//4095 if >4095, color if <4095 
+2
source

You can also easily vectorize this using Intel SSE intrinsics . One 128-bit register will contain 8 of your short and there are functions for min / max / shift / mask for all of them in parallel. In the loop, the constants for min / max can be preloaded into the register. The pshufb command (part of SSSE3) even packs bytes for you.

+1
source

I am going to leave an answer, although it does not directly answer the original question, because, in the end, I think you will find it much more useful.

I assume that your color comes from a 12-bit camera or image scanner, followed by some indefinite processing step, which can create values ​​outside the range 0 ... 4095. If in this case the values ​​are almost certainly displayed in a linear way. The problem is that the gamma displays are fixed, so converting from 12 bits to 8 bits will require a non-linear gamma function, rather than a simple right shift. This will be much slower than the clamp operation your question is trying to optimize. If you do not use the gamma function, the image will be too dark.

 short color = /* some external source */; unsigned char color8bit; if (color <= 0) color8bit = 0; else if (color >= 4095) color8bit = 255; else color8bit = (unsigned char)(255.99 * pow(color / 4095.0, 1/2.2)); 

At this point, you can consider the search table as suggested by Kirill Kobelev .

0
source

This is somewhat reminiscent of Tom Seddon's answer, but uses a slightly cleaner way to clamp higher. Please note that as Mr. Seddon responds, mine avoid the question of ouah's answer, that the transition of a significant value to the right is the behavior determined by the implementation, and therefore work on all architects is not guaranteed.

 #include <inttypes.h> #include <iostream> int16_t clamp(int16_t value) { // clampBelow is 0xffff for -ve, 0x0000 for +ve int16_t const clampBelow = -static_cast<int16_t>(static_cast<uint16_t>(value) >> 15); // value is now clamped below at zero value &= ~clampBelow; // subtract 4095 so we can do the same trick again value -= 4095; // clampAbove is 0xffff for -ve, 0x0000 for +ve, // ie 0xffff for original value < 4095, 0x0000 for original >= 4096 int16_t const clampAbove = -static_cast<int16_t>(static_cast<uint16_t>(value) >> 15); // adjusted value now clamped above at zero value &= clampAbove; // and restore to original value. value += 4095; return value; } void verify(int16_t value) { int16_t const clamped = clamp(value); int16_t const check = (value < 0 ? 0 : value > 4095 ? 4095 : value); if (clamped != check) { std::cout << "Verification falure for value: " << value << ", clamped: " << clamped << ", check: " << check << std::endl; } } int main() { for (int16_t i = 0x4000; i != 0x3fff; i++) { verify(i); } return 0; } 

This is a complete test program (OK, so it does not check 0x3fff - sue me .;)), from which you can extract the clamp() routine for everything you need.

I also broke the “one step per line” clamp for clarity. If your compiler has a half-worthy optimizer, you can leave it as it is and rely on the compiler to create the best possible code. If your compiler optimizer is not so large, then by all means, it can be reduced in the line counter, although at the cost of a little readability.

“Never Sacrifice Clarity for Efficiency” - Bob Buckley, Professor of Computer Engineering, U-Warwick, Coventry, England, 1980

The best advice I've ever received.;)

0
source

Source: https://habr.com/ru/post/1432821/


All Articles