The NVIDIA GT 555M GPU is a device with 2.1 computing power, so there is basic hardware support for basic dual-precision operations, including fused multipy-add (FMA). As with all NVIDIA GPUs, the square root operation is performed. I am familiar with CUDA, but not with GLSL. According to version 4.3 of the GLSL specification , it provides a double precision FMA function as a fma() function and provides a double precision square root, sqrt() . It is unclear if sqrt() is implemented correctly in accordance with IEEE-754 rules. I guess this is similar to CUDA.
Instead of using the Taylor series, one would have to use a minimal Remez approximation. To optimize speed and accuracy, the use of FMA is essential. Estimation of a polynomial using the Horner scheme is sufficient for high accrual. The code below uses a second-order Orner scheme. As in the DanceIgel file, acos conveniently calculated using the asin approximation as the main building block in combination with standard mathematical identifiers.
With 400 M test vectors, the maximum relative error observed with the code below was 2.67 e-16, and the maximum error ulp 1.442 ulp.
double asin_core (double a) { double q, r, s, t; s = a * a; q = s * s; r = 5.5579749017470502e-2; t = -6.2027913464120114e-2; r = fma (r, q, 5.4224464349245036e-2); t = fma (t, q, -1.1326992890324464e-2); r = fma (r, q, 1.5268872539397656e-2); t = fma (t, q, 1.0493798473372081e-2); r = fma (r, q, 1.4106045900607047e-2); t = fma (t, q, 1.7339776384962050e-2); r = fma (r, q, 2.2372961589651054e-2); t = fma (t, q, 3.0381912707941005e-2); r = fma (r, q, 4.4642857881094775e-2); t = fma (t, q, 7.4999999991367292e-2); r = fma (r, s, t); r = fma (r, s, 1.6666666666670193e-1); t = a * s; r = fma (r, t, a); return r; } double my_acos (double a) { double r; r = (a > 0.0) ? -a : a;