Why is it faster to print the number in binary format using arithmetic instead of _bittest

Question

Why is it faster to print the number in binary format using arithmetic instead of _bittest

The purpose of the next two sections of the code is to print the number in binary format.
The first does this with two commands (_bittest), and the second executes pure arithmetic instructions, which are three instructions.
first section of code:

#include <intrin.h> #include <stdio.h> #include <Windows.h> long num = 78002; int main() { unsigned char bits[32]; long nBit; LARGE_INTEGER a, b, f; QueryPerformanceCounter(&a); for (size_t i = 0; i < 100000000; i++) { for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = _bittest(&num, nBit); } } QueryPerformanceCounter(&b); QueryPerformanceFrequency(&f); printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart); printf_s("Binary representation:\n"); while (nBit--) { if (bits[nBit]) printf_s("1"); else printf_s("0"); } return 0; }

inner loop compiles into bt and setb statements
Second section of code:

 #include <intrin.h> #include <stdio.h> #include <Windows.h> long num = 78002; int main() { unsigned char bits[32]; long nBit; LARGE_INTEGER a, b, f; QueryPerformanceCounter(&a); for (size_t i = 0; i < 100000000; i++) { long curBit = 1; for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = (num&curBit) >> nBit; curBit <<= 1; } } QueryPerformanceCounter(&b); QueryPerformanceFrequency(&f); printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart); printf_s("Binary representation:\n"); while (nBit--) { if (bits[nBit]) printf_s("1"); else printf_s("0"); } return 0; }

The inner loop is compiled and added (as a left shift) and sir.
the second part of the code runs three times faster than the first.

Why are three cpu commands faster than two?

-one

c ++ performance assembly cpu-architecture

amit Nov 18 '17 at 21:01

source share

3 answers

Ped7g · Answer 1 · 2017-11-19T00:55:41+0000

Not the answer (Bo did), but the second version of the inner loop can be simplified a bit:

  long numCopy = num; for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = numCopy & 1; numCopy >>= 1; }

Has a subtle difference (less than 1 instruction) with gcc 7.2 targeting 32b .

_{(I am assuming a 32b target since you are converting a long to a 32-bit array, which only makes sense for 32b targets ... and I am assuming x86 as it includes <windows.h> , so it is clear for an obsolete OS target, although I think Windows now has version 64b (I don't care.))}

Answer:

Why are three cpu commands faster than two?

Since the number of instructions only correlates with performance (usually less is better), but the modern x86 processor is a much more complex machine, translating the actual x86 instructions into microcode before execution, converting it further with things like out- (to break false chains dependencies), and then it executes the obtained microcode with various units of the CPU, capable of performing only some micro-operations, so in the ideal case you can get 2-3 micro-ops, executed in parallel 2-3 units face in one cycle, and in the worst case, you can execute a full microcode cycle that implements some complex x86 instruction with several cycles to complete, blocking most of the CPU blocks.

Another factor is the availability of data from memory and writing to memory, one cache miss, when data must be extracted from a higher level cache or even the memory itself, it creates stop signals from tens to hundreds. The presence of compact data structures that favor predictable access patterns and do not exhaust all cache lines is of utmost importance for maximizing CPU performance.

If you are at the “why 3 instructions are faster than 2 instructions” stage, you can pretty much start with any x86 optimization article / book and continue reading for several months or a year, which is a pretty complicated topic.

You can check this answer https://gamedev.stackexchange.com/q/27196 for further reading ...

Peter Cordes · Answer 2 · 2017-11-19T06:42:28+0000

I assume that you are using x86-64 MSVC CL19 (or something that does similar code).

_bittest is slower because MSVC does a terrible job and stores the value in memory, and bt [mem], reg much slower than bt reg,reg . This is a missed compiler optimization . This happens even if you use a local variable instead of num , even if the initializer remains constant!

I have included some performance analysis for Intel Sandybridge processors because they are common; you didn’t say yes, it’s important: bt [mem], reg has one throughput of 3 cycles per Ryzen, one per 5 cycles of throughput on Haswell. And other performative characteristics differ ...

(Just looking at asm, it is usually useful to create a function with args to get code that the compiler cannot execute by constant distribution. In this case, it cannot be because it does not know if all changes from num to main are triggered, because it not static .)

Your command count did not include the entire cycle, so your score is incorrect, but, more importantly, you did not take into account the different costs of different instructions . (See Agner Fog Datasheets and Optimization Guide .)

This is your entire inner loop with _bittest inner, with a uop count for Haswell / Skylake:

  for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = _bittest(&num, nBit); //bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient }

ASM exit from MSVC CL19 -Ox in Godbolt compiler explorer

 $LL7@main : bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput lea rcx, QWORD PTR [rcx+1] ; 1 uop setb al ; 1 uop inc ebx ; 1 uop mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data) cmp ebx, 31 jb SHORT $LL7@main ; 1 uop (macro-fused with cmp)

In this 15 domain domains, therefore, it can issue (at 4 per cycle) in 3.75 cycles. But this is not a bottleneck: Agner Fog testing showed that bt [mem], reg has a bandwidth of one per 5 clock cycles.

IDK, why is it 3 times slower than your other loop. Maybe other ALU instructions compete for the same port as bt , or a dependency problem caused by this problem, or just being a microcoded instruction, or maybe the outer loop is less efficient?

In any case, using bt [mem], reg instead of bt reg, reg is the main missed optimization. This loop would be faster than your other loop with 1 uop, 1c latency, 2 by 1 bt r9d, ebx .

The inner loop is compiled and added (as a left shift) and sir.

A? These are the instructions that the MSVC associates with the source line curBit <<= 1; (even if this line is fully implemented add self,self , and a variable arithmetic shift is part of another line.)

But the whole cycle is this clumsy mess:

  long curBit = 1; for (nBit = 0; nBit < 31; nBit++) { bits[nBit] = (num&curBit) >> nBit; curBit <<= 1; } $LL18@main : # MSVC CL19 -Ox mov ecx, ebx ; 1 uop lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits mov eax, r9d ; 1 uop. r9d holds num inc ebx ; 1 uop and eax, edx ; 1 uop # MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not. add edx, edx ; 1 uop sar eax, cl ; 3 uops (variable-count shifts suck) mov BYTE PTR [r8-1], al ; 1 uop (micro-fused) cmp ebx, 31 jb SHORT $LL18@main ; 1 uop (macro-fused with cmp)

So, these are 11 matching domains with smooth domains and takes 2.75 clock cycles at each iteration to exit the interface.

I do not see any chains of segments with a chain longer than this foreground bottleneck, so this probably happens quickly.

Copying ebx to ecx for each iteration instead of using ecx as a loop counter ( nBit ) is an obvious missed optimization. A shift count is needed in cl to shift the variable counter (unless you activate BMI2 instructions, if MSVC can do this).

There are big missed optimizations here (in the “fast” version) , so probably you should write your source in different ways, keep your compiler in creating less bad code. He implements this quite literally, instead of turning it into something that the processor can do efficiently, or using bt reg,reg / setc

How to do it fast in asm or using built-in functions

Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each element of the byte of the vector and PANDN (to invert the vector) using the mask that has the right bit for this element. PCMPEQB versus zero. This gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0' , conditionally turning it into '1' .

The first steps of this (getting vector 0 / -1 from the bitmask) How to invert _mm256_movemask_epi8 (VPMOVMSKB)? .

The fastest way to decompress 32 bits into a 32-byte SIMD vector (version 128b). Without SSSE3 ( pshufb ), I think punpcklbw / punpcklwd (and possibly pshufd ) is that you need to repeat each byte num 8 times and make two 16-byte vectors.
is there a reverse instruction for movemask command in intel avx2? .

In scalar code, this is one way that works with 1 bit -> bytes per cycle. There are probably ways to do better without using SSE2 (saving several bytes at once to bypass 1 storage for the clock bottleneck that exists on all current processors), but why bother? Just use SSE2.

  mov eax, [num] lea rdi, [rsp + xxx] ; bits[] .loop: shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg shr eax, 1 setc [rdi+1] add rdi, 2 cmp end_pointer ; compare against another register instead of a separate counter. jb .loop

Deployed in two to avoid bottlenecks on the interface, so this can work 1 bit per clock cycle.

Bo persson · Answer 3 · 2017-11-19T00:38:28+0000

The difference is that in the code _bittest(&num, nBit); a num pointer is used, which forces it to store it in memory. And access to memory makes the code much slower.

  bits[nBit] = _bittest(&num, nBit); 00007FF6D25110A0 bt dword ptr [num (07FF6D2513034h)],ebx ; <----- 00007FF6D25110A7 lea rcx,[rcx+1] 00007FF6D25110AB setb al 00007FF6D25110AE inc ebx 00007FF6D25110B0 mov byte ptr [rcx-1],al

Another version stores all variables in registers and uses very fast register shifts and adds. No access to memory.

Why is it faster to print the number in binary format using arithmetic instead of _bittest

How to do it fast in asm or using built-in functions

More articles: