I assume that you are using x86-64 MSVC CL19 (or something that does similar code).
_bittest
is slower because MSVC does a terrible job and stores the value in memory, and bt [mem], reg
much slower than bt reg,reg
. This is a missed compiler optimization . This happens even if you use a local variable instead of num
, even if the initializer remains constant!
I have included some performance analysis for Intel Sandybridge processors because they are common; you didn’t say yes, it’s important: bt [mem], reg
has one throughput of 3 cycles per Ryzen, one per 5 cycles of throughput on Haswell. And other performative characteristics differ ...
(Just looking at asm, it is usually useful to create a function with args to get code that the compiler cannot execute by constant distribution. In this case, it cannot be because it does not know if all changes from num
to main
are triggered, because it not static
.)
Your command count did not include the entire cycle, so your score is incorrect, but, more importantly, you did not take into account the different costs of different instructions . (See Agner Fog Datasheets and Optimization Guide .)
This is your entire inner loop with _bittest
inner, with a uop count for Haswell / Skylake:
for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = _bittest(&num, nBit);
ASM exit from MSVC CL19 -Ox
in Godbolt compiler explorer
$LL7@main : bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput lea rcx, QWORD PTR [rcx+1] ; 1 uop setb al ; 1 uop inc ebx ; 1 uop mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data) cmp ebx, 31 jb SHORT $LL7@main ; 1 uop (macro-fused with cmp)
In this 15 domain domains, therefore, it can issue (at 4 per cycle) in 3.75 cycles. But this is not a bottleneck: Agner Fog testing showed that bt [mem], reg
has a bandwidth of one per 5 clock cycles.
IDK, why is it 3 times slower than your other loop. Maybe other ALU instructions compete for the same port as bt
, or a dependency problem caused by this problem, or just being a microcoded instruction, or maybe the outer loop is less efficient?
In any case, using bt [mem], reg
instead of bt reg, reg
is the main missed optimization. This loop would be faster than your other loop with 1 uop, 1c latency, 2 by 1 bt r9d, ebx
.
The inner loop is compiled and added (as a left shift) and sir.
A? These are the instructions that the MSVC associates with the source line curBit <<= 1;
(even if this line is fully implemented add self,self
, and a variable arithmetic shift is part of another line.)
But the whole cycle is this clumsy mess:
long curBit = 1; for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = (num&curBit) >> nBit; curBit <<= 1; } $LL18@main :
So, these are 11 matching domains with smooth domains and takes 2.75 clock cycles at each iteration to exit the interface.
I do not see any chains of segments with a chain longer than this foreground bottleneck, so this probably happens quickly.
Copying ebx
to ecx
for each iteration instead of using ecx
as a loop counter ( nBit
) is an obvious missed optimization. A shift count is needed in cl
to shift the variable counter (unless you activate BMI2 instructions, if MSVC can do this).
There are big missed optimizations here (in the “fast” version) , so probably you should write your source in different ways, keep your compiler in creating less bad code. He implements this quite literally, instead of turning it into something that the processor can do efficiently, or using bt reg,reg
/ setc
How to do it fast in asm or using built-in functions
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each element of the byte of the vector and PANDN (to invert the vector) using the mask that has the right bit for this element. PCMPEQB versus zero. This gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask)
to subtract 0 or -1 (add 0 or 1) to ASCII '0'
, conditionally turning it into '1'
.
The first steps of this (getting vector 0 / -1 from the bitmask) How to invert _mm256_movemask_epi8 (VPMOVMSKB)? .
In scalar code, this is one way that works with 1 bit -> bytes per cycle. There are probably ways to do better without using SSE2 (saving several bytes at once to bypass 1 storage for the clock bottleneck that exists on all current processors), but why bother? Just use SSE2.
mov eax, [num] lea rdi, [rsp + xxx] ; bits[] .loop: shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg shr eax, 1 setc [rdi+1] add rdi, 2 cmp end_pointer ; compare against another register instead of a separate counter. jb .loop
Deployed in two to avoid bottlenecks on the interface, so this can work 1 bit per clock cycle.