This discussion assumes that your compiler either does not support the operation or does not create a sufficient assembly. Please note that they are currently unlikely, so I would recommend using __builtin_clz for gcc or equivalent on your compiler.
Please note that determining what is the βbestβ clz algo can only be done by you. Modern processors are complex beasts, and the performance of this algorithm will largely depend on the platform on which you run it, the data that you throw on it, and the code that uses it. The only way to be sure is to measure, measure and measure a little more. If you canβt tell the difference, then you probably donβt look at your bottleneck and your time will be better spent elsewhere.
Now that there are boring disclaimers, take a look at what Hacker Delight is saying about the problem. A quick survey shows that all algorithms are based on a binary search for some description. Here is a simple example:
int n = 32; unsigned y; y = x >>16; if (y != 0) { n = n -16; x = y; } y = x >> 8; if (y != 0) { n = n - 8; x = y; } y = x >> 4; if (y != 0) { n = n - 4; x = y; } y = x >> 2; if (y != 0) { n = n - 2; x = y; } y = x >> 1; if (y != 0) return n - 2; return n - x;
Please note that this works on 32 int and that, if necessary, it can also be converted to an iterative version. Unfortunately, this solution does not have many levels of parallelism at the instruction level and has quite a few branches that do not make for a very good bit-twisting algorithm. Please note that there is a branch-free version above, but it is much more detailed, so I will not reproduce here.
So, let me improve the solution using the pop command (counts the number of bits):
x = x | (x >> 1); x = x | (x >> 2); x = x | (x >> 4); x = x | (x >> 8); x = x | (x >>16); return pop(~x);
So how does it work? The key is the pop(~x) command at the end, which counts the number of zeros in x . For counting zeros to be significant, we first need to get rid of all 0 that are not being kept. We do this by right-distributing 1s using a binary algorithm. While we still lack the parallelism instruction level, we got rid of all the branches, and it uses fewer loops than the previous solution. Much better.
So what about this pop instruction, isn't that a hoax? Most architectures have step-by-step instructions on 1 cycle, which can be accessed through built-in compilers (for example, gcc __builtin_pop ). Otherwise, there are table-based solutions, but care should be taken when trading loops to access the cache, even if the table is completely stored in the L1 cache.
Finally, as usual for hacker enthusiasm, we begin to wander in foreign territories. Let us count some leading zeros using floating point numbers:
union { unsigned asInt[2]; double asDouble; }; asDouble = (double)k + 0.5; return 1054 - (asInt[LE] >> 20);
First, a small warning: DO NOT USE THIS ALGORITHM . This causes undefined behavior with respect to the standard. This reproduced for a fun factor more than any practical application. Use your own danger.
Now that the disclaimer does not work, how does it work? First, it converts int to double and continues to extract the exponent component from double. Neat stuff. The LE constant must be 1 if it is executed on a machine with a small end, and 0 - on a machine with a large end.
This should give you a brief overview of the various bit-twisting algorithms for this problem. Please note that there are several variations in the book that make various compromises, but I will let you open them yourself.