SIMD vectorize atan2 using ARM NEON assembly

I want to calculate the magnitude and angle of 4 points using neon SIMD commands and a lever. There is a built-in library in most languages, C ++ in my case, which calculates the angle (atan2), but only for one pair of floating point variables (x and y). I would like to use SIMD instructions that deal with q-registers to calculate atan2 for a vector of 4 values.

Accuracy should not be high, speed is more important.


I already have some assembly instructions that calculate the value of 4 floating point registers, with acceptable accuracy for my application. q1 contains 4 "x" values ​​(x1, x2, x3, x4). q2 contains 4 "y" values ​​(y1, y2, y3, y4). q7 contains the value of 4 results (x1 ^ 2 + y1 ^ 2, x2 ^ 2 + y2 ^ 2, x3 ^ 2 + y3 ^ 2, x4 ^ 2 + y4 ^ 2).

vmul.f32 q7, q1, q1 vmla.f32 q7, q2, q2 vrecpe.f32 q7, q7 vrsqrte.f32 q7, q7 

What is the fastest way to calculate the approximate atan2 for two vectors using SIMD instructions?

+4
source share
1 answer

See math-neon for the existing unambiguous float implementation. Since it has no (or few) conventions, it should translate well to the implementation of SIMD .

Since ARM NEON does not have instructions for calculating this directly, then there are various methods for creating approximations that are better than the Taylor series . In particular, the min-max approach provides a good polynomial candidate for approximation. min-max refers to minimizing the maximum error; with the Chebyshev approximation , as a rule, very good.

DSP guru has features for different methods of approximating functions. There are also numerous books online. You can search for optimal polynomials using matlab, an octave, or some other set of tools. Generally, you need to associate this with range and accuracy. If you have a good algorithm for a single value, expanding it to SIMD of any type should be trivial.

The question calculates atan2 has a link to the Apple atan.c source. The performance factors in the code are most likely derived from what I gave above. The problem with this code is that it does not scale to SIMD , and the atan() approximation is piecewise, and you need different coefficients depending on the range. For your SIMD, you will need the same coefficients (multipliers, dividers, equations) within the range.

Abramowitz and Stegun: The Mathematical Functions Handbook contains a chapter on circular functions with section 4.4.28 giving logarithmic formulas. This is similar to an eglibc implementation.

+6
source

Source: https://habr.com/ru/post/1496581/


All Articles