I study how cache misses affect the speed of calculations. I know that there are many algorithms for multiplying two matrices (even a simple exchange of two of the loops below will help), but please consider this code:
float a[N][N]; float b[N][N]; float c[N][N]; // ... { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { float sum = 0.0; for (int k = 0; k < N; k++) { sum = sum + a[i][k] * b[k][j]; } c[i][j] = sum; } } }
I recompiled this code for many N values ββand measured its launch time. I expected a sudden increase in time by about N=1250 , after which the matrix c no longer fits the cache (size c is 1250*1250*sizeof(float)=6250000 or about 6 MB, which is the size of my L3 cache).
Indeed, the general trend is that after this point the average time has approximately tripled compared to the extrapolated time previously. But the value of N%8 , apparently, has a huge impact on the result. For instance:
1601 - 11.237548 1602 - 7.679103 1603 - 12.216982 1604 - 6.283644 1605 - 11.360517 1606 - 7.486021 1607 - 11.292025 1608 - 5.794537 1609 - 11.469469 1610 - 7.581660 1611 - 11.367203 1612 - 6.126014 1613 - 11.730543 1614 - 7.632121 1615 - 11.773091 1616 - 5.778463 1617 - 11.556687 1618 - 7.682941 1619 - 11.576068 1620 - 6.273122 1621 - 11.635411 1622 - 7.804220 1623 - 12.053517 1624 - 6.008985
For some time, I thought that it could be alignment problems - the rows of any matrix are aligned to 32 bytes when N%8==0 (the first question is why are 32 bytes in particular? SSE commands like movaps can work on 16B aligned data).
Another idea was that this might have something to do with the cache association (8-way for L1 and L2 and 12-way for L3 on my machine).
But then I noticed that for some values ββof N , such as 1536 , unexpected bursts appear (although alignment should be excellent in these cases - 1536==256*6 , associativity is also not a problem) 1536==128*12==192*8 ). For instance:
1504 - 4.644781 1512 - 4.794254 1520 - 4.768555 1528 - 4.884714 1536 - 7.949040 1544 - 5.162613 1552 - 5.083331 1560 - 5.388706
Time is pretty consistent, so CPU spikes are not a problem. I am compiling code with optimizations enabled ( -O2 ). My ideas, unfortunately, are ending. What could be causing this behavior?