How does this square root smoothing work?

I found a rather strange but working square root approximation for float s; I really do not understand. Can someone explain to me why this code works?

 float sqrt(float f) { const int result = 0x1fbb4000 + (*(int*)&f >> 1); return *(float*)&result; } 

I tested it a little and it displays std::sqrt() values ​​by about 1-3% . I know Quake III is a quick inverse square root , and I think there is something similar here (without the newton iteration), but I would really appreciate an explanation of how this works.

(nota: I marked it as c and C ++ , ish (see comments) C and C ++ code)

+50
c ++ optimization c floating-point ieee-754
Mar 30 '17 at 13:57
source share
4 answers

(*(int*)&f >> 1) shifts to the right the bitwise representation of f . This almost divides the indicator into two, which is roughly equivalent to taking the square root. one

Why almost? In IEEE-754, the actual rate is e-127. 2 To divide this into two, we need e / 2 - 64, but the above approximation gives only e / 2 - 127. Therefore, we need to add 63 to the resulting exponent. This is provided by bits 30-23 of this magic constant ( 0x1fbb4000 ).

I would suggest that the remaining bits of the magic constant were chosen to minimize the maximum error in the mantissa range or something like that. However, it is unclear whether it was determined analytically, iteratively or heuristically.




It is worth noting that this approach is somewhat not portable. This makes (at least) the following assumptions:

  • The platform uses IEEE-754 with one precision for float .
  • Consistency of representation float .
  • So that you do not suffer from undefined behavior due to the fact that this approach violates the rules of strict anti-aliasing C / C ++.

Therefore, this should be avoided unless you are sure that it gives predictable behavior on your platform (and indeed that it provides useful acceleration and sqrtf !).




<sub> 1. sqrt (a ^ b) = (a ^ b) ^ 0.5 = a ^ (b / 2)

<sub> 2. See https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding

+70
Mar 30 '17 at 14:14
source share

See Oliver Charlworth's description of why this almost works. I review the issue raised in the comments.

Since several people have pointed out this intolerance, here are a few ways to make it more portable, or at least get the compiler to tell you if this will work.

Firstly, C ++ allows you to check std::numeric_limits<float>::is_iec559 at compile time, for example in static_assert . You can also check that sizeof(int) == sizeof(float) , which will not be true if int is 64-bit, but what you really want to do is use uint32_t , which if it exists will always be exactly 32 bits wide, will have well-defined behavior with shifts and overflows, and will cause a compilation error if your strange architecture does not have such an integral type. In any case, you also need static_assert() so that the types are the same size. Static statements are not time-consuming, and you should always check your premises in this way, if possible.

Unfortunately, checking whether the bits convert to float to uint32_t and the offset is big-endian, little-endian, or in no way can be calculated as an expression of a compile-time constant. Here I put a runtime check on the part of the code that depends on it, but you can put it in the initialization and do it once. In practice, both gcc and clang can optimize this test at compile time.

You do not want to use an unsafe pointer, and there are some systems that I worked on in the real world where this can lead to a program crash with a bus error. The most portable way to convert object representations is memcpy() . In my example below, I am writing a pun with union , which works with any actually existing implementation. (Lawyers object to this, but no successful compiler will ever break this multi-valued code silently.) If you need to do pointer conversion (see below), there is alignas() . But no matter how you do it, the result will be determined by the implementation, so we check the result of the conversion and change the test value.

In any case, not that you will most likely use it on a modern processor, there is a Russified version of C ++ 14 that checks these intolerable assumptions:

 #include <cassert> #include <cmath> #include <cstdint> #include <cstdlib> #include <iomanip> #include <iostream> #include <limits> #include <vector> using std::cout; using std::endl; using std::size_t; using std::sqrt; using std::uint32_t; template <typename T, typename U> inline T reinterpret(const U x) /* Reinterprets the bits of x as a T. Cannot be constexpr * in C++14 because it reads an inactive union member. */ { static_assert( sizeof(T)==sizeof(U), "" ); union tu_pun { U u = U(); T t; }; const tu_pun pun{x}; return pun.t; } constexpr float source = -0.1F; constexpr uint32_t target = 0x5ee66666UL; const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U; const bool is_little_endian = after_rshift == target; float est_sqrt(const float x) /* A fast approximation of sqrt(x) that works less well for subnormal numbers. */ { static_assert( std::numeric_limits<float>::is_iec559, "" ); assert(is_little_endian); // Could provide alternative big-endian code. /* The algorithm relies on the bit representation of normal IEEE floats, so * a subnormal number as input might be considered a domain error as well? */ if ( std::isless(x, 0.0F) || !std::isfinite(x) ) return std::numeric_limits<float>::signaling_NaN(); constexpr uint32_t magic_number = 0x1fbb4000UL; const uint32_t raw_bits = reinterpret<uint32_t,float>(x); const uint32_t rejiggered_bits = (raw_bits >> 1U) + magic_number; return reinterpret<float,uint32_t>(rejiggered_bits); } int main(void) { static const std::vector<float> test_values{ 4.0F, 0.01F, 0.0F, 5e20F, 5e-20F, 1.262738e-38F }; for ( const float& x : test_values ) { const double gold_standard = sqrt((double)x); const double estimate = est_sqrt(x); const double error = estimate - gold_standard; cout << "The error for (" << estimate << " - " << gold_standard << ") is " << error; if ( gold_standard != 0.0 && std::isfinite(gold_standard) ) { const double error_pct = error/gold_standard * 100.0; cout << " (" << error_pct << "%)."; } else cout << '.'; cout << endl; } return EXIT_SUCCESS; } 

Update

The following is an alternative definition of reinterpret<T,U>() , which avoids confusion. You can also implement a quibble in modern C, where it is allowed by standard, and call the function as extern "C" . I think type-punning is more elegant, type-safe, and consistent with the quasi-functional style of this program than memcpy() . I also don’t think that you are gaining a lot because you can still have undefined behavior from a hypothetical representation of the trap. In addition, clang ++ 3.9.1 -O -S is able to statically analyze a version of the Punning type, optimize the is_little_endian variable to the constant 0x1 and exclude the run-time test, but it can optimize this version only for a single instruction.

But more importantly, this code is not guaranteed to work on every compiler. For example, some older computers cannot even address exactly 32 bits of memory. But in these cases, he should not compile and tell you why. No compiler is just about to destroy a huge amount of legacy code. Although the standard technically gives permission for this and still says that it complies with C ++ 14, this will only happen in an architecture that is very different from what is expected. And if our assumptions are so invalid that some compiler is going to turn a keyword between a float and an 32-bit unsigned integer into a dangerous error, I really doubt that the logic of this code will delay if we just use memcpy() . We want this code not to work at compile time, and to tell us why.

 #include <cassert> #include <cstdint> #include <cstring> using std::memcpy; using std::uint32_t; template <typename T, typename U> inline T reinterpret(const U &x) /* Reinterprets the bits of x as a T. Cannot be constexpr * in C++14 because it modifies a variable. */ { static_assert( sizeof(T)==sizeof(U), "" ); T temp; memcpy( &temp, &x, sizeof(T) ); return temp; } constexpr float source = -0.1F; constexpr uint32_t target = 0x5ee66666UL; const uint32_t after_rshift = reinterpret<uint32_t,float>(source) >> 1U; extern const bool is_little_endian = after_rshift == target; 

However, Stroustrup et al., In C ++ Fundamentals , recommend instead of reinterpret_cast :

 #include <cassert> template <typename T, typename U> inline T reinterpret(const U x) /* Reinterprets the bits of x as a T. Cannot be constexpr * in C++14 because it uses reinterpret_cast. */ { static_assert( sizeof(T)==sizeof(U), "" ); const U temp alignas(T) alignas(U) = x; return *reinterpret_cast<const T*>(&temp); } 

The compilers I tested can also optimize this to a folded constant. An example of Straustrap is [sic]:

Access to the reinterpret_cast result for another type from the declared object type is still undefined, but at least we can see that something complicated is happening.

+13
Mar 30 '17 at 23:28
source share

Let y = sqrt (x),

it follows from the properties of the logarithms that log (y) = 0.5 * log (x) (1)

Interpretation of the normal float as a whole gives INT (x) = Ix = L * (log (x) + B - Οƒ) (2)

where L = 2 ^ N, N is the number of bits of the significant, B is the bias of the exponent, and Οƒ is the free coefficient for setting the approximation.

The combination of (1) and (2) gives: Iy = 0.5 * (Ix + (L * (B - Οƒ)))

What is written in the code as (*(int*)&x >> 1) + 0x1fbb4000;

Find Οƒ so that the constant is 0x1fbb4000 and determines whether it is optimal.

+8
Mar 30 '17 at 15:51
source share

Adding a test harness to test all float .

The approximation is not more than 4% for many float , but very low for under normal numbers. Ymmv

 Worst:1.401298e-45 211749.20% Average:0.63% Worst:1.262738e-38 3.52% Average:0.02% 

Note that with argument +/- 0.0, the result is not zero.

 printf("% e % e\n", sqrtf(+0.0), sqrt_apx(0.0)); // 0.000000e+00 7.930346e-20 printf("% e % e\n", sqrtf(-0.0), sqrt_apx(-0.0)); // -0.000000e+00 -2.698557e+19 



Test code

 #include <float.h> #include <limits.h> #include <math.h> #include <stddef.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> float sqrt_apx(float f) { const int result = 0x1fbb4000 + (*(int*) &f >> 1); return *(float*) &result; } double error_value = 0.0; double error_worst = 0.0; double error_sum = 0.0; unsigned long error_count = 0; void sqrt_test(float f) { if (f == 0) return; volatile float y0 = sqrtf(f); volatile float y1 = sqrt_apx(f); double error = (1.0 * y1 - y0) / y0; error = fabs(error); if (error > error_worst) { error_worst = error; error_value = f; } error_sum += error; error_count++; } void sqrt_tests(float f0, float f1) { error_value = error_worst = error_sum = 0.0; error_count = 0; for (;;) { sqrt_test(f0); if (f0 == f1) break; f0 = nextafterf(f0, f1); } printf("Worst:%e %.2f%%\n", error_value, error_worst*100.0); printf("Average:%.2f%%\n", error_sum / error_count); fflush(stdout); } int main() { sqrt_tests(FLT_TRUE_MIN, FLT_MIN); sqrt_tests(FLT_MIN, FLT_MAX); return 0; } 
+6
Mar 30 '17 at 15:59
source share



All Articles