How to improve the performance of the next cycle

I have a simple loop in C where I convert the magnitude and angle parts to real and imaginary . I have two versions of the loop. Version 1 is a simple loop in which I perform the conversion using the following code

 for(k = 0; k < n; k++){ xReal[k] = Mag[k] * cos(Angle[k]); xImag[k] = Mag[k] * sin(Angle[k]); } 

An Version 2 , where Intrinsics are used to vectorize a loop.

 __m256d cosVec, sinVec; __m256d resultReal, resultImag; __m256d angVec, voltVec; for(k = 0; k < SysData->totNumOfBus; k+=4){ voltVec = _mm256_loadu_pd(volt + k); angVec = _mm256_loadu_pd(theta + k); sinVec = _mm256_sincos_pd(&cosVec, angVec); resultImag = _mm256_mul_pd(voltVec, sinVec); resultReal = _mm256_mul_pd(voltVec, cosVec); _mm256_store_pd(xReal+k, resultReal); _mm256_store_pd(xImag+k, resultImag); } 

In the Core i7 2600k @3.4GHz these cycles give the following results:

 Version 1: n = 18562320, Time: 0.2sec Version 2: n = 18562320, Time: 0.16sec 

Simple calculations with these values ​​show that in Version 1 each iteration takes almost loops 36 to be completed, while a Version 2 loop requires loop 117 . Given the fact that calculating the sine and cosine functions is naturally expensive, these numbers do not seem to be scary. However, this cycle is a serious bottleneck of my function, as profiling shows that almost 1/3 time is spent in the cycle. So, I am wondering if there is a way to speed up this loop (for example, computing the sine and cosine functions differently). Clearly, help me solve this problem and let me know if there is room for improving the performance of this cycle.

Thank you in advance for your help.

PS: I use icc to compile the code. In addition, I must mention that the data is not aligned (and cannot be). However, data alignment only leads to a slight performance improvement (less than 1 percent).

+4
source share
3 answers

I recommend using the sin / cos function of the tayler series and _mm256_stream_pd () functions to store data. Here is a basic code example.

  __m256d sin_req[10]; __m256d cos_req[10]; __m256d one_pd = _mm256_set1_pd(1.0); for(int i=0; i<10; ++i) { sin_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+1) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+1) ); cos_req[i] = i%2 == 0 ? _mm256_set1_pd(-1.0/Factorial((i+1)*2+0) ) : _mm256_set1_pd(+1.0/Factorial((i+1)*2+0) ); } for(int i=0; i<count; i+=4) { __m256d voltVec = _mm256_load_pd(volt + i); __m256d angVec = _mm256_load_pd(theta + i); // sin/cos by taylor series __m256d angleSq = angVec * angVec; __m256d sinVec = angVec; __m256d cosVec = one_pd; __m256d sin_serise = sinVec; __m256d cos_serise = one_pd; for(int j=0; j<10; ++j) { sin_serise = sin_serise * angleSq; // [1] cos_serise = cos_serise * angleSq; sinVec = sinVec + sin_serise * sin_req[j]; cosVec = cosVec + cos_serise * cos_req[j]; } __m256d resultReal = voltVec * sinVec; __m256d resultImag = voltVec * cosVec; _mm256_store_pd(xReal + i, resultReal); _mm256_store_pd(xImag + i, resultImag ); } 

I can get 57 ~ 58 CPU cycles to calculate 4 components.

I searched google and conducted some tests for the accuracy of my sin / cos. Some articles say that 10-iteration has double precision, and -M_PI / 2 <angle <+ M_PI / 2. And my test result shows it more accurate than math.h sin / cos with -M_PI <angle <+ M_PI. You can increase the iteration for greater accuracy at large angles, if necessary.

However, I will go deeper to optimize this code. This code has timeout calculation timeouts. Multiple AVX latency is 5 processor cycles, which means that we cannot run one iteration faster than 5, because [1] uses the result of the previous iteration result.

We can just expand it like this.

  for(int i=0; i<count; i+=8) { __m256d voltVec0 = _mm256_load_pd(volt + i + 0); __m256d voltVec1 = _mm256_load_pd(volt + i + 4); __m256d angVec0 = _mm256_load_pd(theta + i + 0); __m256d angVec1 = _mm256_load_pd(theta + i + 4); __m256d sinVec0; __m256d sinVec1; __m256d cosVec0; __m256d cosVec1; __m256d angleSq0 = angVec0 * angVec0; __m256d angleSq1 = angVec1 * angVec1; sinVec0 = angVec0; sinVec1 = angVec1; cosVec0 = one_pd; cosVec1 = one_pd; __m256d sin_serise0 = sinVec0; __m256d sin_serise1 = sinVec1; __m256d cos_serise0 = one_pd; __m256d cos_serise1 = one_pd; for(int j=0; j<10; ++j) { sin_serise0 = sin_serise0 * angleSq0; cos_serise0 = cos_serise0 * angleSq0; sin_serise1 = sin_serise1 * angleSq1; cos_serise1 = cos_serise1 * angleSq1; sinVec0 = sinVec0 + sin_serise0 * sin_req[j]; cosVec0 = cosVec0 + cos_serise0 * cos_req[j]; sinVec1 = sinVec1 + sin_serise1 * sin_req[j]; cosVec1 = cosVec1 + cos_serise1 * cos_req[j]; } __m256d realResult0 = voltVec0 * sinVec0; __m256d imagResult0 = voltVec0 * cosVec0; __m256d realResult1 = voltVec1 * sinVec1; __m256d imagResult1 = voltVec1 * cosVec1; _mm256_store_pd(xReal + i + 0, realResult0); _mm256_store_pd(xImag + i + 0, imagResult0); _mm256_store_pd(xReal + i + 4, realResult1); _mm256_store_pd(xImag + i + 4, imagResult1); } 

This result is 51 ~ 51.5 cycles to calculate 4 components. (102 ~ 103 cycles for 8 components)

It eliminates the mutiply misconception in the taylor calculation cycle and uses 85% of the AVX multiplication unit. Unrolling will solve many latency problems while it does not replace registers with memory. Generate the asm file at compile time and see how your compiler processes your code. I tried to deploy more, but this is bad because it could not fit into 16 AVX registers.

Now we go with the memory option. replace _mm256_store_ps () with _mm256_stream_ps ().

  _mm256_stream_pd(xReal + i + 0, realResult0); _mm256_stream_pd(xImag + i + 0, imagResult0); _mm256_stream_pd(xReal + i + 4, realResult1); _mm256_stream_pd(xImag + i + 4, imagResult1); 

Replacing the result of writing to the memory of 48 cycles to calculate 4 components.

_mm256_stream_pd () is always faster if you are not going to read it. It skips the caching system and sends data directly to the memory controller and does not pollute your cache. You will get more data / cache space to read data with _mm256_stream_pd ().

Try prefetching.

  for(int i=0; i<count; i+=8) { _mm_prefetch((const CHAR *)(volt + i + 5 * 8), _MM_HINT_T0); _mm_prefetch((const CHAR *)(theta + i + 5 * 8), _MM_HINT_T0); // calculations here. } 

Now I got 45.6 ~ 45.8 CPU cycles per calculation. 94% are occupied by the AVX multiplication block.

Prefech suggests a cache for faster reading. I recommend prefech up to 400 ~ 500 CPU cycles based on RAS-CAS latency physical memory. Physical memory latency can take up to 300 cycles in the worst case. may vary depending on the configuration of the equipment, there will be no less than 200 cycles, even if you use expensive memory with RAS-CAS delay.

0.064 sec (amount = 18562320)

End of optimization sin / cos .:-)

+5
source

check:

  • Does the starting address of the array match 16 bytes. i7 supports avx unsigned high latency without bus error complaint

  • please check the cache hit rate and skip using the profile tool. Memory access seems to be the bottleneck of version 2 of the loop

  • you can lower the accuracy or use the result table to calculate sin and cos.

  • please consider how much you will improve performance. Since version 1 of the cycle takes only 1/3 of the total time. if you optimize the loop to zero, performance will only improve 30%

+1
source

Synchronization results show that version 2 is faster (20%) compared to version 1.

 Version 1: n = 18562320, Time: 0.2sec Version 2: n = 18562320, Time: 0.16sec 

Not sure how you calculate the loops used in each version? There is a lot of work in the processor, and cache fetching can cause a time difference, although v1 uses fewer loops (again, not knowing how you calculated the loops).

Or another way to explain this is that with vectorization, data elements are available without any wait time for memory deletes.

0
source

Source: https://habr.com/ru/post/1496688/


All Articles