I recommend using the sin / cos function of the tayler series and _mm256_stream_pd () functions to store data. Here is a basic code example.
__m256d sin_req[10]; __m256d cos_req[10]; __m256d one_pd = _mm256_set1_pd(1.0); for(int i=0; i<10; ++i) { sin_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+1) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+1) ); cos_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+0) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+0) ); } for(int i=0; i<count; i+=4) { __m256d voltVec = _mm256_load_pd(volt + i); __m256d angVec = _mm256_load_pd(theta + i); // sin/cos by taylor series __m256d angleSq = angVec * angVec; __m256d sinVec = angVec; __m256d cosVec = one_pd; __m256d sin_serise = sinVec; __m256d cos_serise = one_pd; for(int j=0; j<10; ++j) { sin_serise = sin_serise * angleSq; // [1] cos_serise = cos_serise * angleSq; sinVec = sinVec + sin_serise * sin_req[j]; cosVec = cosVec + cos_serise * cos_req[j]; } __m256d resultReal = voltVec * sinVec; __m256d resultImag = voltVec * cosVec; _mm256_store_pd(xReal + i, resultReal); _mm256_store_pd(xImag + i, resultImag ); }
I can get 57 ~ 58 CPU cycles to calculate 4 components.
I searched google and conducted some tests for the accuracy of my sin / cos. Some articles say that 10-iteration has double precision, and -M_PI / 2 <angle <+ M_PI / 2. And my test result shows it more accurate than math.h sin / cos with -M_PI <angle <+ M_PI. You can increase the iteration for greater accuracy at large angles, if necessary.
However, I will go deeper to optimize this code. This code has timeout calculation timeouts. Multiple AVX latency is 5 processor cycles, which means that we cannot run one iteration faster than 5, because [1] uses the result of the previous iteration result.
We can just expand it like this.
for(int i=0; i<count; i+=8) { __m256d voltVec0 = _mm256_load_pd(volt + i + 0); __m256d voltVec1 = _mm256_load_pd(volt + i + 4); __m256d angVec0 = _mm256_load_pd(theta + i + 0); __m256d angVec1 = _mm256_load_pd(theta + i + 4); __m256d sinVec0; __m256d sinVec1; __m256d cosVec0; __m256d cosVec1; __m256d angleSq0 = angVec0 * angVec0; __m256d angleSq1 = angVec1 * angVec1; sinVec0 = angVec0; sinVec1 = angVec1; cosVec0 = one_pd; cosVec1 = one_pd; __m256d sin_serise0 = sinVec0; __m256d sin_serise1 = sinVec1; __m256d cos_serise0 = one_pd; __m256d cos_serise1 = one_pd; for(int j=0; j<10; ++j) { sin_serise0 = sin_serise0 * angleSq0; cos_serise0 = cos_serise0 * angleSq0; sin_serise1 = sin_serise1 * angleSq1; cos_serise1 = cos_serise1 * angleSq1; sinVec0 = sinVec0 + sin_serise0 * sin_req[j]; cosVec0 = cosVec0 + cos_serise0 * cos_req[j]; sinVec1 = sinVec1 + sin_serise1 * sin_req[j]; cosVec1 = cosVec1 + cos_serise1 * cos_req[j]; } __m256d realResult0 = voltVec0 * sinVec0; __m256d imagResult0 = voltVec0 * cosVec0; __m256d realResult1 = voltVec1 * sinVec1; __m256d imagResult1 = voltVec1 * cosVec1; _mm256_store_pd(xReal + i + 0, realResult0); _mm256_store_pd(xImag + i + 0, imagResult0); _mm256_store_pd(xReal + i + 4, realResult1); _mm256_store_pd(xImag + i + 4, imagResult1); }
This result is 51 ~ 51.5 cycles to calculate 4 components. (102 ~ 103 cycles for 8 components)
It eliminates the mutiply misconception in the taylor calculation cycle and uses 85% of the AVX multiplication unit. Unrolling will solve many latency problems while it does not replace registers with memory. Generate the asm file at compile time and see how your compiler processes your code. I tried to deploy more, but this is bad because it could not fit into 16 AVX registers.
Now we go with the memory option. replace _mm256_store_ps () with _mm256_stream_ps ().
_mm256_stream_pd(xReal + i + 0, realResult0); _mm256_stream_pd(xImag + i + 0, imagResult0); _mm256_stream_pd(xReal + i + 4, realResult1); _mm256_stream_pd(xImag + i + 4, imagResult1);
Replacing the result of writing to the memory of 48 cycles to calculate 4 components.
_mm256_stream_pd () is always faster if you are not going to read it. It skips the caching system and sends data directly to the memory controller and does not pollute your cache. You will get more data / cache space to read data with _mm256_stream_pd ().
Try prefetching.
for(int i=0; i<count; i+=8) { _mm_prefetch((const CHAR *)(volt + i + 5 * 8), _MM_HINT_T0); _mm_prefetch((const CHAR *)(theta + i + 5 * 8), _MM_HINT_T0);
Now I got 45.6 ~ 45.8 CPU cycles per calculation. 94% are occupied by the AVX multiplication block.
Prefech suggests a cache for faster reading. I recommend prefech up to 400 ~ 500 CPU cycles based on RAS-CAS latency physical memory. Physical memory latency can take up to 300 cycles in the worst case. may vary depending on the configuration of the equipment, there will be no less than 200 cycles, even if you use expensive memory with RAS-CAS delay.
0.064 sec (amount = 18562320)
End of optimization sin / cos .:-)