There is a chance to do this faster than the lookup table, if arithmetic calculations for some reason are faster than memory access. This may be possible if the calculations are vectorized (PPC AltiVec or Intel SSE) and / or if other parts of the program need to use every bit of the cache memory.
If the expansion coefficient = 3, only 7 commands are needed:
out = (((in * 0x101 & 0x0F00F) * 0x11 & 0x0C30C3) * 5 & 0x249249) * 7;
Or another alternative, with 10 instructions:
out = (in | in << 8) & 0x0F00F; out = (out | out << 4) & 0x0C30C3; out = (out | out << 2) & 0x249249; out *= 7;
For other expansion coefficients> = 3:
unsigned mask = 0x0FF; unsigned out = in; for (scale = 4; scale != 0; scale /= 2) { shift = scale * (N - 1); mask &= ~(mask << scale); mask |= mask << (scale * N); out = out * ((1 << shift) + 1) & mask; } out *= (1 << N) - 1;
Or another alternative, for expansion coefficients> = 2:
unsigned mask = 0x0FF; unsigned out = in; for (scale = 4; scale != 0; scale /= 2) { shift = scale * (N - 1); mask &= ~(mask << scale); mask |= mask << (scale * N); out = (out | out << shift) & mask; } out *= (1 << N) - 1;
shift and mask values ββare better calculated before processing the bitstream.
source share