Is NEON of ARM faster for integers than floats?

Or do floating point and integer operators have the same speed? And if not, how much faster will the integer version be?

+4
source share
3 answers

You can find information about Programming for Advanced SIMD Instructions for Cortex-A8 (they do not publish it for new kernels since then, the time of business has become quite complicated since).

See the Advanced SIMA instructions of the integer ALU compared to the Advanced SIMD floating point instructions :

You may need to read the explanation of how to read these tables .

To give a complete answer, in general, floating point instructions take two cycles, while instructions are executed on the ALU, one cycle is executed. On the other hand, multiplying a long long (8-byte integer) represents four loops ( forum of the same source ), while double multiplying is equal to two loops.

In general, it seems that you should not worry about float and integer, but careful selection of the data type (float vs double, int vs long long) is more important.

+6
source

It depends on what model you have, but the tendency was for the whole to have more opportunities to use 128-bit data paths. This no longer applies to new processors.

Of course, integer arithmetic also gives you the ability to increase parallelism using 16-bit or 8-bit operations.

As with all integer-versus-floating-point arguments, this depends on the particular problem and how much time you are willing to invest in tuning, because they rarely can execute the exact same code.

+4
source

I would call auselen's answer for excellent links to all links, however I found that the actual loop is considered a bit erroneous. It’s true that it can β€œgo one way” depending on the accuracy you need, but let me say that you have parallelism in your program and can work effectively with two words (SP float) at a time. Suppose you need precision for which a floating point might be a good idea ... 24 bits.

In particular, when analyzing NEON performance, remember that there is a write delay (pipeline delay), so you need to wait until the result is ready if this result is required as an input to another instruction.

For a fixed point, you will need 32-bit ints to represent at least 24 bits of precision:

  • Multiply two-by-two 32-bit numbers together and get a 64-bit result. It takes two cycles and requires an additional register to store a wide result.
  • Move the 64-bit numbers back to the 32-bit numbers of the required precision. This takes one cycle, and you have to wait for a write-back delay (5-6 cycles) when multiplying.

For floating point:

  • Multiply two-by-two 32-bit merges. It takes one cycle.

So, for this scenario, there is no way for you to ever select a floating point integer.

If you are dealing with 16-bit data, the tradeoffs are much closer, although you may need additional instructions to shift the result of the multiplication by the required accuracy. To achieve good performance, if you use Q15, you can use the VQDMULH instruction for s16 data and achieve much higher performance with fewer registers than SP float.

In addition, as noted by auselen, new kernels have different micro-architectures, and things always change. We are fortunate that ARM actually publishes information. For sellers who modify the microarchitecture, such as Apple, Qualcomm and Samsung (maybe others ...), the only way to find out is to try, which can be a lot of work if you are writing an assembly. However, I think the official ARM instruction website is probably quite useful. And I really think that they publish numbers for the A9 , and they are basically identical.

+2
source

Source: https://habr.com/ru/post/1483765/


All Articles