Is there a more direct method to convert float to int with rounding than adding 0.5f and truncating conversion?

Converting from float to int with rounding occurs quite often in C ++ code that works with floating point data. One of them, for example, is to create conversion tables.

Consider this piece of code:

// Convert a positive float value and round to the nearest integer int RoundedIntValue = (int) (FloatValue + 0.5f); 

C / C ++ defines (int) cast as truncating, so you need to add 0.5f to ensure rounding to the nearest positive integer (when the input is positive). For the above, the VS2015 compiler generates the following code:

 movss xmm9, DWORD PTR __real@3f000000 // 0.5f addss xmm0, xmm9 cvttss2si eax, xmm0 

The above works, but may be more effective ...

Intel developers, it seemed, thought that it was enough to solve the problem with one command, which would do exactly what was needed: Convert to the nearest integer value: cvtss2si (note, there is only one for the mnemonics).

If cvtss2si was to replace the cvttss2si instruction in the above sequence, two of the three commands would simply be eliminated (as well as using an additional xmm register, which may lead to better optimization in general).

So, how can we code C ++ instructions (s) to do this simple job with a single cvtss2si instruction?

I already thought a lot, trying to do something like the following, but even with the optimizer for the task, it does not come down to one machine instruction that could / should have done the job:

 int RoundedIntValue = _mm_cvt_ss2si(_mm_set_ss(FloatValue)); 

Unfortunately, the above seems to be striving to clear an entire register vector that will never be used, instead of just using a single 32-bit value.

 movaps xmm1, xmm0 xorps xmm2, xmm2 movss xmm2, xmm1 cvtss2si eax, xmm2 

Perhaps I do not see an obvious approach here.

Can you suggest the proposed set of C ++ instructions that will ultimately generate a single cvtss2si instruction?

+4
source share
2 answers

Visual Studio 15.6, released today, is finally fixing this problem. Now we see one instruction used when embedding this function:

 inline int ConvertFloatToRoundedInt(float FloatValue) { return _mm_cvt_ss2si(_mm_set_ss(FloatValue)); // Convert to integer with rounding } 

I am impressed that Microsoft has finally acquired a round tut.

+1
source

This is an optimization flaw in the Microsoft compiler, and on error modern versions of GCC, Clang and ICC produce the expected code . For a function like:

 int RoundToNearestEven(float value) { return _mm_cvt_ss2si(_mm_set_ss(value)); } 

all compilers, but Microsoft will release the following object code:

 cvtss2si eax, xmm0 ret 

while the Microsoft compiler (starting with version VS 2015 Update 3) emits the following:

 movaps xmm1, xmm0 xorps xmm2, xmm2 movss xmm2, xmm1 cvtss2si eax, xmm2 ret 

The same is true for the double-precision version, cvtsd2si (i.e. inside _mm_cvtsd_si32 ).

Until the optimizer is improved, there is no faster alternative. Fortunately, the generated code is not as slow as it might seem. Moving and clearing registers is one of the fastest possible instructions, and some of them can probably be implemented exclusively in the interface when the register renames. And this, of course, is faster than any of the possible alternatives - often in order:

  • The triple of addition 0.5, which you spoke about, will not only be slower, because it must load the constant and complete the addition, it will also not give a correctly rounded result in all cases.

  • Using _mm_load_ss to load a floating point value into an __m128 structure, suitable for use with the _mm_cvt_ss2si internal value, is pessimization, as it causes a memory leak, not just a shift of registers to register.

    (Note that while _mm_set_ss always better for x86-64, where the calling convention uses SSE registers to pass floating point values, I sometimes noticed that _mm_load_ss will produce more optimal code in x86-32 lines than _mm_set_ss , but it strongly depends on many factors and was observed only when using several built-in functions in a complex code sequence. Your default choice should be _mm_set_ss .)

  • Substituting a reinterpret_cast<__m128&>(value) (or moral equivalent) for _mm_set_ss internal is unsafe and inefficient. This results in a spill from the SSE register into memory; The cvtss2si command then uses this memory cell as its original operand.

  • Declaring a __m128 temporary structure and initializing a value is safe, but even more inefficient. Space is allocated on the stack for the entire structure, and then each slot is filled with either 0 or a floating point value. This structure memory location is then used as the source operand for cvtss2si .

  • The lrint family of functions provided by the standard C library should do what you want and actually compile simple cvt* instructions for some other compilers, but are extremely under-optimized in the Microsoft compiler. They are never tied, so you always pay the cost of calling the function. In addition, the code inside the function is not optimal. Both of these were like bugs , but we are still waiting for a fix. Similar problems arise with other conversion functions provided by the standard library, including lround and friends.

  • x5 FPU offers the FIST / FISTP , which performs a similar task, but the C and C ++ standards require that the trimmed sheet, and not round-to-nearest-even (by default FPU), so the compiler must insert a bunch of code to change the current rounding mode, perform the conversion, and then change it. This is very slow, and there is no way to instruct the compiler not to do this other than using the built-in assembly. Besides the fact that the built-in assembly is not available in the 64-bit compiler, the syntax of the MSVC built-in assembly also does not allow you to specify inputs and outputs, so you pay double load and save fines in both directions. And even if it isn’t, you still have to pay the cost of copying the floating point value from the SSE register to memory and then to the FPU x87 stack.

Intrinsics are great and often allow you to create code that is faster than what the compiler would otherwise have created, but they are not perfect. If you are like me and often analyze the parsing of your binaries, you will often be disappointed. However, your best bet here is to use the inside.

Regarding why the optimizer emits the code the way it does, I can only assume since I do not work in the Microsoft compiler team, but I assume that some other cvt* instructions have false dependencies that need to be processed by the code generator. For example, cvtss2sd does not change the upper 64 bits of the destination XMM register. Such partial-case updates cause kiosks and reduce the possibility for parallelism instruction level. This is especially a problem in loops, where the upper bits of the register form the second chain of dependencies associated with the loop, even though we actually don't care about their contents. Since the cvtss2sd command cvtss2sd not be started until the previous instruction completes, latency will increase significantly. However, by first executing the xorss or movss , the upper bits of the register are cleared, thereby breaking the dependency and avoiding the possibility of a stall. This is an example of an interesting case where a shorter code does not match a faster code. The compiler team began to insert these instructions for reducing dependencies for scalar transformations in the compiler shipped with VS 2010 , and probably overused the heuristic.

+6
source

Source: https://habr.com/ru/post/1262388/


All Articles