This is a question added to users experienced in the SSE / AVX instruction family, and to those who are familiar with its performance analysis. I have seen many different implementations and approaches, ranging from older for SSE2 to newer. The network is flooded with such links. But personally, I do not really understand the analysis of sse assembly. Some people point to uops, caches, and this requires some low-level knowledge. Therefore, I ask for hints and your personal experience. If you have time to make a comparison, on the topic “What is the fastest” and why, what were you looking at. The implementation may not be as accurate, 10-16 bits of single precision FP are enough. More is better, but when it does not affect speed.
PS. To avoid the metatream, I could describe the task exactly with the details:
- The specified scalar argument x (in radians), which is passed in the xmm register (according to the x64 fastcall convention).
- Write a function with a signature
__m128 sincos(float x); which returns the approximate values of sin (x) and cos (x). - The return value must be in one register xmm and calculated in the fastest way to satisfy the requirement of 10-bit precision.
- The argument can be any real number (but not
nan, infetc.). If normalization of the argument is required by the approach, its implementation (fmod ()) will also be the subject. But the question is not about handling special cases of FP.
It may be a duplicate, but I did not find a similar question here, so please indicate to me if it already exists.