For maximum speed, you can process 8 samples in parallel, storing 8 bytes packed in one 64-bit battery.
Initialize a lookup table with 256 64-bit entries, which are the original 8 bits extended to bytes. (For example, the entry Lookup[0x17u]
mapped to 0x0000000100010101ul
.)
Counting is done only with
Acc+= Lookup[Byte];
You extract individual counters by matching an array of 8 bytes on 64 bits.
You can accumulate 256 times before overflowing; if you need more, accumulate in blocks of 256, and after the block has been processed, transfer the calculations to larger batteries.
If you need to accumulate no more than 16 times, 32-bit batteries are enough.
source share