I would call auselen's answer for excellent links to all links, however I found that the actual loop is considered a bit erroneous. Itβs true that it can βgo one wayβ depending on the accuracy you need, but let me say that you have parallelism in your program and can work effectively with two words (SP float) at a time. Suppose you need precision for which a floating point might be a good idea ... 24 bits.
In particular, when analyzing NEON performance, remember that there is a write delay (pipeline delay), so you need to wait until the result is ready if this result is required as an input to another instruction.
For a fixed point, you will need 32-bit ints to represent at least 24 bits of precision:
- Multiply two-by-two 32-bit numbers together and get a 64-bit result. It takes two cycles and requires an additional register to store a wide result.
- Move the 64-bit numbers back to the 32-bit numbers of the required precision. This takes one cycle, and you have to wait for a write-back delay (5-6 cycles) when multiplying.
For floating point:
- Multiply two-by-two 32-bit merges. It takes one cycle.
So, for this scenario, there is no way for you to ever select a floating point integer.
If you are dealing with 16-bit data, the tradeoffs are much closer, although you may need additional instructions to shift the result of the multiplication by the required accuracy. To achieve good performance, if you use Q15, you can use the VQDMULH instruction for s16 data and achieve much higher performance with fewer registers than SP float.
In addition, as noted by auselen, new kernels have different micro-architectures, and things always change. We are fortunate that ARM actually publishes information. For sellers who modify the microarchitecture, such as Apple, Qualcomm and Samsung (maybe others ...), the only way to find out is to try, which can be a lot of work if you are writing an assembly. However, I think the official ARM instruction website is probably quite useful. And I really think that they publish numbers for the A9 , and they are basically identical.
source share