When to use parallel counting - MIT HAKMEM for bitcontact, when is a memory problem?

Bitcounting can be done in several ways, for example. with a fighter of a given bit, canceled by an iterator of bits, pre-computed bits with lookup tables or parallel counting. As I found out when searching the Internet, unset bit iterator is fast when there are fewer mismatched bits, and the iterator bit is the other way around. But when should you use parallel counting, MIT HAKMEM (see below) in particular? It looks pretty fast, although probably slower than lookup tables. Is it always better compared to the set / unset bit in terms of speed? Are there any other relationships regarding which to choose than speed and memory?

int BitCount(unsigned int u) { unsigned int uCount; uCount = u - ((u >> 1) & 033333333333) - ((u >> 2) & 011111111111); return ((uCount + (uCount >> 3)) & 030707070707) % 63; } 
+6
source share
4 answers

Why choose one method of counting bits after another? Well, it really depends on your machine and the problem you are trying to solve. Please note that all of the instructions below are for the basic RISC processor and may not translate well to a more complex beast such as x86.

The HAKMEM algorithm you specified will execute in 13 instructions, but is unlikely to be very fast due to the module operator. By stroking it, it looks like it has a pretty good level of parallelism commands, which should help if your processor is capable of using it.

Bo Persson's algorithm is presented quite quickly ( 2 + 5*pop(x) instructions), but only if the word is sparsely populated. It can also be modified to work with populous words. It also contains branches and does not have a significant level of parallelism commands.

EDIT: The table lookup method can also be very fast, but does access memory. If the whole table is in L1 cache, then this is probably one of the fastest algorithms. If the table is not in the cache, it is almost certainly one of the slowest.

Below is the algorithm of one of the HAKMEM algorithms and is presented in the Hacker Delight book (I highly recommend this book if you are like these things). It is executed in 19 instructions and has no branch. He also does not use division, but has multiplication. It is also very economical in the way it uses registers by reusing the same mask as much as possible. There is still no significant level of parallelism command here that I see.

 int pop(unsigned x) { unsigned n; n = (x >> 1) & 0x77777777; x = x - n; n = (n >> 1) & 0x77777777; x = x - n; n = (n >> 1) & 0x77777777; x = x - n; x = (x + (x >> 4)) & 0x0F0F0F0F; x = x * 0x01010101; return x >> 24; } 

The Hacker Delight book also presents several even more specialized algorithms for fields 9-8-7 bits wide or using floating point operators. Note that most of the analysis I presented was also partially taken from this book.

The fact is that the load on the truck methods and the only way to make sure that works best in your specific situation is to measure and compare. I really understand that this is a pretty complete answer, but the alternative is to know your processor and compiler inside out.

+5
source

It is very simple, but surprisingly fast if there are only a few bits:

 int popCount (U64 x) { int count = 0; while (x) { count++; x &= x - 1; // reset LS1B } return count; } 

From the Chess Wiki: Brian Kernighan Way

+4
source

If you have an x86-CPU that supports SSE4, you already have one instruction to do all the work: POPCNT. See Intelยฎ SSE4 Programming Reference . (PDF, 698KB)

One more note: Parallel algorithms like HAKMEM 169 do not contain conditional branches. This is a huge advantage. Modern processors predict the direction of conditional branches, but have problems with random branches. Mispredictions are very expensive (the cost may be more than the equivalent of 100 instructions). Better to avoid loops with a variable loop and / or conditional statements if performance is important. For more information:

I also took advantage of the recommendation of the book Hackers Delight .

+1
source

How about the following:

  • From each raw word r, compute the pair-bent word as `w = r - (r & 0x55555555)`. Each pair of bits will contain the number of bits in this pair (0-2).
  • Next, calculate the quadratic word `q = (w & 0x33333333) + ((w โ†’ 2) & 0x33333333)` from each pair of fluttering words. Each quartet of set bits will contain the number of bits in that quartet (0-4).
  • The set of quad words in groups of two, and from each sum `s` calculates the sum filled with the octet` o = (s & 0x0F0F0F0F) + ((s โ†’ 4) & 0x0F0F0F0F) `. Each octet will contain the total number of specified bits in the corresponding octet in both input words (0-16).
  • Sum the octet sums in eight groups and from each sum `` calculate the half-month sum `h = (s & 0x00FF00FF) + ((s โ†’ 8) & 0x00FF00FF) '. Each halfword will contain the total number of bits in the corresponding halfword of all 16 input words (0-256).
  • The totality of the summed half-layers in groups of 128 and from each sum `` calculates the total amount `t = (s & 0x0000FFFF) + ((s โ†’ 16) & 0xFFFF). This will contain the number of bits set in 1024 input words (0-32768)

Note that for each word, two steps are performed, one for each other word, one for every sixteen and one for every 1024. If the words are 64 bits instead of 32, an additional step is necessary for the final sum, but the sum of the words scored sum, can be added to groups of 65 536 (which is 67 108 864 input words).

0
source

Source: https://habr.com/ru/post/904241/


All Articles