What does `static_cast <volatile void>` mean for the optimizer?
When people try to run rigorous tests in different libraries, I sometimes see code like this:
auto std_start = std::chrono::steady_clock::now(); for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) volatile const auto __attribute__((unused)) c = std_set.count(i + j); auto std_stop = std::chrono::steady_clock::now(); Volatile is used here, so that the optimizer does not notice that the result of the test code is discarded, and then discards all calculations.
If the test code does not return a value, say, it is void do_something(int) , then sometimes I see the code as follows:
auto std_start = std::chrono::steady_clock::now(); for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) static_cast<volatile void> (do_something(i + j)); auto std_stop = std::chrono::steady_clock::now(); Is this the correct use of volatile ? What is volatile void ? What does this mean in terms of compiler and standard?
The standard (N4296) in [dcl.type.cv] says:
7 [Note: volatile is an implementation hint to avoid aggressive optimization involving the object because the value of the object can be changed using tools that are not detected by the implementation. Moreover, for some implementations, volatility may indicate that access requires special hardware instructions from the object. See 1.9 for detailed semantics. In general, the semantics of volatiles should be the same in C ++ as in C. - end note]
In section 1.9, he points out many recommendations on the execution model, but with regard to variability, it refers to โaccess to the volatile objectโ. It is not clear to me that the execution of the instruction that was selected in volatile void means that I understand the code correctly and exactly what if some kind of optimization barrier is created.
static_cast<volatile void> (foo()) does not work as a way to force the compiler to actually calculate foo() in any of gcc / clang / MSVC / ICC with optimization enabled.
#include <bitset> void foo() { for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) { std::bitset<64> std_set(i + j); //volatile const auto c = std_set.count(); // real work happens static_cast<volatile void> (std_set.count()); // optimizes away } } compiles only in ret with all 4 major x86 compilers . (MSVC emits asm for standalone std::bitset::count() definitions or something like that, but scrolls down for its trivial definition of foo() .
(Source + asm output for this and the following example on Matt Godbolt Compiler Explorer )
Maybe there are some compilers where static_cast<volatile void>() does something, in which case it may be an easier way to write a repeat loop that does not carry out instructions for storing the result in memory, but only calculates it. (Sometimes this may be what you want in a micro-object).
Accumulating the result with tmp += foo() (or tmp |= ) and returning it from main() or printing it with printf can also be useful instead of storing it in the volatile variable. Or various compiler-specific things, such as using an empty asm built-in statement to break the compiler's ability to optimize without actually adding any instructions.
See Chandler Carruth CppCon2015 for a discussion on using perf to explore compiler optimization , where it shows the optimization function for GNU C. But its escape() function is written so that the value is in memory (passing it asm a void* using clobber "memory" ). We donโt need this, we just need the compiler to have a value in a register or memory, or even in an immediate constant. (It is not possible to fully unwrap our loop because it does not know that the asm instruction is zero.)
This code is compiled only for popcnt without any additional repositories, on gcc .
// just force the value to be in memory, register, or even immediate // instead of empty inline asm, use the operand in a comment so we can see what the compiler chose. Absolutely no effect on optimization. static void escape_integer(int a) { asm volatile("# value = %0" : : "g"(a)); } // simplified with just one inner loop void test1() { for (int i = 0; i < 10000; ++i) { std::bitset<64> std_set(i); int count = std_set.count(); escape_integer(count); } } #gcc8.0 20171110 nightly -O3 -march=nehalem (for popcnt instruction): test1(): # value = 0 # it peels the first iteration with an immediate 0 for the inline asm. mov eax, 1 .L4: popcnt rdx, rax # value = edx # the inline-asm comment has the %0 filled in to show where gcc put the value add rax, 1 cmp rax, 10000 jne .L4 ret Clang decides to put a value in memory to satisfy the "g" constraint, which is pretty dumb. But clang tends to do this when you give it an inline-asm constraint that includes memory as an option. So this is no better than the Chandler escape function for this.
# clang5.0 -O3 -march=nehalem test1(): xor eax, eax #DEBUG_VALUE: i <- 0 .LBB1_1: # =>This Inner Loop Header: Depth=1 popcnt rcx, rax mov dword ptr [rsp - 4], ecx # value = -4(%rsp) # inline asm gets a value in memory inc rax cmp rax, 10000 jne .LBB1_1 ret ICC18 with -march=haswell does the following:
test1(): xor eax, eax #30.16 ..B2.2: # Preds ..B2.2 ..B2.1 # optimization report # %s was not vectorized: ASM code cannot be vectorized xor rdx, rdx # breaks popcnt false dep on the destination popcnt rdx, rax #475.16 inc rax #30.34 # value = edx cmp rax, 10000 #30.25 jl ..B2.2 # Prob 99% #30.25 ret #35.1 Oddly enough, ICC used xor rdx,rdx instead of xor eax,eax . This discards the REX prefix and is not recognized as a violation of the Silvermont / KNL dependency.