Is there a better way to optimize the potential function of lennard jones?

In fact, this is a derivative of the potential of Lennard Jones. The reason is that I am writing a Molecular Dynamics program, and at least 80% of the time is spent on the next function, even with the most aggressive compiler options (gcc ** -O3).

double ljd(double r) /* Derivative of Lennard Jones Potential for Argon with respect to distance (r) */ { double temp; temp = Si/r; temp = temp*temp; temp = temp*temp*temp; return ( (24*Ep/r)*(temp-(2 * pow(temp,2))) ); } 

This code is from the "functs.h" file, which I am importing into my main file. I thought using temporary variables this way would make the function faster, but I am worried that creating them is too wasteful. Should I use static? Also, the code is written in parallel using openmp, so I can not declare temp as a global variable?

Ep and Si variables are defined (using #define). I used only C for about 1 month. I tried to look at the assembler code generated by gcc, but I was completely lost. \

+4
source share
6 answers

I would get rid of calling pow() to run:

 double ljd(double r) /* Derivative of Lennard Jones Potential for Argon with respect to distance (r) */ { double temp; temp = Si / r; temp = temp * temp; temp = temp * temp * temp; return ( (24.0 * Ep / r) * (temp - (2.0 * temp * temp)) ); } 
+7
source

In my architecture (Intel Centrino Duo, MinGW-gcc 4.5.2 on Windows XP) non-optimized code using pow()

 static inline double ljd(double r) { return 24 * Ep / Si * (pow(Si / r, 7) - 2 * pow(Si / r, 13)); } 

actually surpasses your version if -ffast-math provided.

The generated assembly (using some arbitrary values โ€‹โ€‹for Ep and Si ) is as follows:

 fldl LC0 fdivl 8(%ebp) fld %st(0) fmul %st(1), %st fmul %st, %st(1) fld %st(0) fmul %st(1), %st fmul %st(2), %st fxch %st(1) fmul %st(2), %st fmul %st(0), %st fmulp %st, %st(2) fxch %st(1) fadd %st(0), %st fsubrp %st, %st(1) fmull LC1 
+3
source

Well, as I said, compilers suck floating point optimization for many reasons. So here is the Intel build version that should be faster (compiled using DevStudio 2005):

 const double Si6 = /*whatever pow(Si,6) is*/; const double Si_value = /*whatever Si is*/; /* need _value as Si is a register name! */ const double Ep24 = /*whatever 24.Ep is*/; double ljd (double r) { double result; __asm { fld qword ptr [r] fld st(0) fmul st(0),st(0) fld st(0) fmul st(0),st(0) fmulp st(1),st(0) fld qword ptr [Si6] fdivrp st(1),st(0) fld st(0) fld1 fsub st(0),st(1) fsubrp st(1),st(0) fmulp st(1),st(0) fld qword ptr [Ep24] fmulp st(1),st(0) fdivrp st(1),st(0) fstp qword ptr [result] } return result; } 

This version will give slightly different results for the published version. The compiler is likely to write intermediate results in RAM in the source code. This will lose accuracy because the FPU (Intel) runs 80 bits internally, while the dual type only 64 bits. The above assembler will not lose the accuracy of the intermediate results, all this is done at 80 bits. Only the final result is rounded to 64 bits.

+1
source

Ah, that brings me back some memories ... I've been working with Lennard Jones for many years.

In my scenario (not huge systems) it was enough to replace pow() few multiplications, as suggested by another answer. I also limited the range of neighbors, effectively reducing the potential by about r ~ 3.5 and then applying some standard thermodynamic correction.

But if this is not enough for you, I suggest that you first copy the function for close to each other r values โ€‹โ€‹and simply interpolate (for example, linear or quadratic).

+1
source

Is your application structured in such a way that you can profitably vectorize this function by computing several independent values โ€‹โ€‹in parallel? This will allow you to use hardware vector units such as SSE.

It also seems like you're better off not supporting 1/r values, rather than r .


This is an example that explicitly uses SSE2 instructions to implement the function. ljd() calculates two values โ€‹โ€‹at once.

 static __m128d ljd(__m128d r) { static const __m128d two = { 2.0, 2.0 }; static const __m128d si = { Si, Si }; static const __m128d ep24 = { 24 * Ep, 24 * Ep }; __m128d temp2, temp3; __m128d temp = _mm_div_pd(si, r); __m128d ep24_r = _mm_div_pd(ep24, r); temp = _mm_mul_pd(temp, temp); temp2 = _mm_mul_pd(temp, temp); temp2 = _mm_mul_pd(temp2, temp); temp3 = _mm_mul_pd(temp2, temp2); temp3 = _mm_mul_pd(temp3, two); return _mm_mul_pd(ep24_r, _mm_sub_pd(temp2, temp3)); } /* Requires `out` and `in` to be 16-byte aligned */ void ljd_array(double out[], const double in[], int n) { int i; for (i = 0; i < n; i += 2) { _mm_store_pd(out + i, ljd(_mm_load_pd(in + i))); } } 

However, it is important to note that the latest versions of GCC can often automatically perform functions such as this automatically if you aim for the right architecture and turn on optimization. If you are targeting 32-bit x86, try compiling with -msse2 -O3 and adjust the settings so that the input and output arrays are aligned by 16 bytes.

Alignment for static and automatic arrays can be achieved with gcc with the attribute type __attribute__ ((aligned (16))) , and for dynamic arrays using the posix_memalign() function.

+1
source

The local variable is just fine. It does not cost anything. Leave it alone.

As others have said, get rid of the pow call. It cannot be faster than just squaring a number, and it can be much slower.

However, just because the function is active, 80 %% of the time does not mean that this is a problem. It only means that you can optimize, either there, or in something that he calls (for example, pow), or in what causes it.

If you try random suspension , which is a method of fetching the stack, you will see this procedure for 80 +% of the samples, as well as the lines inside it, are responsible for the time, plus its callers, which are responsible for the time, and their callers, and so on. All lines of code on the stack are jointly responsible for the time.

Optimality is not when nothing takes a large percentage of the time, it is when nothing that you can fix takes a large percentage of the time.

+1
source

Source: https://habr.com/ru/post/1338733/


All Articles