A portable, effective alternative to PDEP without using BMI2?

Question

A portable, effective alternative to PDEP without using BMI2?

The documentation for the parallel deposit instruction ( PDEP ) in the Intel bit mask 2 instruction set (BMI2) describes the following sequential implementation for the instruction (C-like pseudo-code):

 U64 _pdep_u64(U64 val, U64 mask) { U64 res = 0; for (U64 bb = 1; mask; bb += bb) { if (val & bb) res |= mask & -mask; mask &= mask - 1; } return res; }

See also Intel PDEP insn ref manual entry .

This algorithm is O (n), where n is the number of bits in the set in mask , which obviously has the worst case O (k), where k is the total number of bits in mask .

Is a more efficient worst case algorithm possible?

Is it possible to make a faster version, assuming that val has at most one bit set, that is, it is either 0 or 1<<r for some r value from 0 to 63?

+4

assembly x86 algorithm bit-manipulation bmi

markt1964 Aug 14 '16 at 3:39

source share

1 answer

njuffa · Accepted Answer · 2016-08-14T19:34:42+0000

The second part of the question about the special case of a one-bit deposit requires two steps. At the first stage, we need to determine the bit index r single 1-bit one in val , with a suitable answer if val is zero. This can easily be done using the POSIX ffs function, or if r is known in other ways, as indicated by the respondent in the comments. In the second step, we need to identify the bit index i r -th 1-bit in mask , if it exists. Then we can insert the r -th bit of the val bit into bit i .

One way to search for the r -th 1-bit index in mask is to count 1-bits using the classic population based on a binary partition and record all the intermediate intermediate bits in the group. Then we perform a binary search on the recorded bit counter data to identify the position of the desired bit.

The following C code demonstrates this using 64-bit data. Whether this is actually faster than the iterative method will depend very much on the typical values of mask and val .

 #include <stdint.h> /* Find the index of the n-th 1-bit in mask, n >= 0 The index of the least significant bit is 0 Return -1 if there is no such bit */ int find_nth_set_bit (uint64_t mask, int n) { int t, i = n, r = 0; const uint64_t m1 = 0x5555555555555555ULL; // even bits const uint64_t m2 = 0x3333333333333333ULL; // even 2-bit groups const uint64_t m4 = 0x0f0f0f0f0f0f0f0fULL; // even nibbles const uint64_t m8 = 0x00ff00ff00ff00ffULL; // even bytes uint64_t c1 = mask; uint64_t c2 = c1 - ((c1 >> 1) & m1); uint64_t c4 = ((c2 >> 2) & m2) + (c2 & m2); uint64_t c8 = ((c4 >> 4) + c4) & m4; uint64_t c16 = ((c8 >> 8) + c8) & m8; uint64_t c32 = (c16 >> 16) + c16; int c64 = (int)(((c32 >> 32) + c32) & 0x7f); t = (c32 ) & 0x3f; if (i >= t) { r += 32; i -= t; } t = (c16>> r) & 0x1f; if (i >= t) { r += 16; i -= t; } t = (c8 >> r) & 0x0f; if (i >= t) { r += 8; i -= t; } t = (c4 >> r) & 0x07; if (i >= t) { r += 4; i -= t; } t = (c2 >> r) & 0x03; if (i >= t) { r += 2; i -= t; } t = (c1 >> r) & 0x01; if (i >= t) { r += 1; } if (n >= c64) r = -1; return r; } /* val is either zero or has a single 1-bit. Return -1 if val is zero, otherwise the index of the 1-bit The index of the least significant bit is 0 */ int find_bit_index (uint64_t val) { return ffsll (val) - 1; } uint64_t deposit_single_bit (uint64_t val, uint64_t mask) { uint64_t res = (uint64_t)0; int r = find_bit_index (val); if (r >= 0) { int i = find_nth_set_bit (mask, r); if (i >= 0) res = (uint64_t)1 << i; } return res; }

A portable, effective alternative to PDEP without using BMI2?

More articles: