Peaks of calculation time when working on arrays of a certain size

I created a program that allocates two arrays of C ++ doubles. First contains sin (x) values ​​for x from 0 to pi / 4. In the second, I write the second derivative of sin (x), calculated using the three-point formula:

(vals[i-1]-2*vals[i]+vals[i+1])/(dx*dx) 

I perform these calculations for arrays of different sizes, repeat each time several times and display the average calculation time. This way I get a good graph showing the time it takes to calculate the derivatives for an array of a certain size.

Everything sounds pretty easy, but I ran into some kind of weird problem. Look at the chart: Graph When the arrays are small, nothing strange happens. However, when they are more than 10,000 elements (with two arrays, which means about 16kB of 160 KB of memory), I get periodic peaks, each of which contains 512 elements after the previous one. This happens on several processors (AMD Sempron 3850, Intel Q9400 and Intel Xeon 5310) and does not happen on others (Intel i5 5200U, Intel Nehalem-C).

If this was not enough when the arrays reach 65,000 elements, the AMD 3850 on Windows 7 gets a sharp increase in time. This does not happen on Debian.

I think this may be due to processor caching, given the "magic" number of elements and OS planning, but I can not come up with any specific explanations and a way to confirm this (except for profiling the processor cache, miss).

code:

 int main(int, char**) { double maxerr=0.0; double avgerr=0.0; for(unsigned SIZE=5000u; SIZE<100000u; SIZE+=1u) { const unsigned REPEATS=10000000/SIZE+1; const double STEP=1.0/(SIZE-1.0)*atan(1.0)*4.0; double *vals; double *derivs; // Alokacja vals= new double[SIZE]; derivs=new double[SIZE]; // Inicjalizacja for(unsigned i=0u; i<SIZE; ++i) { vals[i]=sin(i*STEP); derivs[i]=0.0; } // Obliczenia normalne const double TIME_FPU_START=msclock(); for(unsigned r=0u; r<REPEATS; ++r) { const double STEP2RCP=1.0/(STEP*STEP); for(unsigned i=1u; i<SIZE-1u; ++i) { derivs[i]=(vals[i-1]-2.0*vals[i]+vals[i+1u])*STEP2RCP; } } const double TIME_FPU_END=msclock(); const double TIME_FPU=(TIME_FPU_END-TIME_FPU_START)/REPEATS; maxerr=0.0; avgerr=0.0; for(unsigned i=1u; i<SIZE-1; ++i) { double err=fabs((derivs[i]+sin(i*STEP))/(-sin(i*STEP))); avgerr+=err/(SIZE-2); if(err>maxerr) { //printf("%f > %f\n", derivs[i], -sin(i*step)); maxerr=err; } } const double MAX_REL_ERR_FPU=maxerr; const double AVG_REL_ERR_FPU=avgerr; // Podsumowanie fprintf(stdout, "%u %.8f %.8f\n", SIZE, TIME_FPU, AVG_REL_ERR_FPU); fprintf(stderr, "%u %.8f %.8f\n", SIZE, TIME_FPU, AVG_REL_ERR_FPU); // Kasowanie delete [] vals; delete [] derivs; } return 0; } 

The msclock function is a plain old clock () / CLOCKS_PER_SEC on linux, and QueryPerformanceTimer for Windows (clock () and all its variants on Windows seem to have a resolution of about 16 ms.

Data file: https://pastebin.com/4iDn5Y9k The first column indicates the size of the array, the second is the average time, and the last is the average error of the derivative.

Additional information . I checked the addresses of the arrays that I get. The first (vals) seems to constantly remain with the same value, and the second (derivatives) is constantly growing. This probably means that the beginning of the second array is in the same block as the end of the first array. With a 4kiB page size, an increase of 512 elements may be required using another page.

Additional information . At the peaks of Debian, the addresses of arrays are closely related (columns 4 and 5):

 50670 0.00021090 0.00000016 0x55efa5e04c30 0x55efa5e67bb0 50680 0.00021244 0.00000017 0x55efa5e04c30 0x55efa5e67c00 50690 0.00170968 0.00000023 0x7f1617d0b010 0x7f1617ca7010 @ 50700 0.00021141 0.00000024 0x55efa5e04c30 0x55efa5e67ca0 @ 50710 0.00021873 0.00000027 0x55efa5e04c30 0x55efa5e67cf0 50720 0.00021126 0.00000002 0x55efa5e04c30 0x55efa5e67d40 

Additional information . For arrays of less than 10,000 elements, both arrays always have the same address. For more, the second array continues to move. This may be the reason that I do not get peaks for small sizes.

I also played with "perf stat" as @geza suggested. The first exit at startup for sizes from 5000 to 12930:

  Performance counter stats for './a.out': 70733.635707 task-clock (msec) # 0.999 CPUs utilized 6,008 context-switches # 0.085 K/sec 3 cpu-migrations # 0.000 K/sec 3,866 page-faults # 0.055 K/sec 90,800,323,159 cycles # 1.284 GHz (50.00%) 0 stalled-cycles-frontend (50.01%) 0 stalled-cycles-backend # 0.00% backend cycles idle (50.01%) 160,806,960,842 instructions # 1.77 insn per cycle (50.00%) 16,118,385,897 branches # 227.874 M/sec (50.00%) 5,818,878 branch-misses # 0.04% of all branches (49.99%) 70.839631335 seconds time elapsed 

... And for sizes from 12940 to 30000:

  Performance counter stats for './a.out': 157468.056544 task-clock (msec) # 0.999 CPUs utilized 13,420 context-switches # 0.085 K/sec 10 cpu-migrations # 0.000 K/sec 32,271 page-faults # 0.205 K/sec 201,597,569,555 cycles # 1.280 GHz (50.00%) 0 stalled-cycles-frontend (50.00%) 0 stalled-cycles-backend # 0.00% backend cycles idle (50.00%) 351,405,781,351 instructions # 1.74 insn per cycle (50.00%) 35,289,858,166 branches # 224.108 M/sec (50.00%) 10,494,566 branch-misses # 0.03% of all branches (50.00%) 157.692170504 seconds time elapsed 

The program was executed twice twice, so twice as many context switches are not surprising. In this range, however, I get more page errors.

Another interesting thing. When I restarted the program for the second run, the second array did not move so much through memory. It seems that he begins to do this when the program has been running for a while.

The longer I test things, the less I know; _;

+5
source share

Source: https://habr.com/ru/post/1274028/


All Articles