Matrix Multiplication Issues

I study how cache misses affect the speed of calculations. I know that there are many algorithms for multiplying two matrices (even a simple exchange of two of the loops below will help), but please consider this code:

float a[N][N]; float b[N][N]; float c[N][N]; // ... { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { float sum = 0.0; for (int k = 0; k < N; k++) { sum = sum + a[i][k] * b[k][j]; } c[i][j] = sum; } } } 

I recompiled this code for many N values ​​and measured its launch time. I expected a sudden increase in time by about N=1250 , after which the matrix c no longer fits the cache (size c is 1250*1250*sizeof(float)=6250000 or about 6 MB, which is the size of my L3 cache).

Indeed, the general trend is that after this point the average time has approximately tripled compared to the extrapolated time previously. But the value of N%8 , apparently, has a huge impact on the result. For instance:

 1601 - 11.237548 1602 - 7.679103 1603 - 12.216982 1604 - 6.283644 1605 - 11.360517 1606 - 7.486021 1607 - 11.292025 1608 - 5.794537 1609 - 11.469469 1610 - 7.581660 1611 - 11.367203 1612 - 6.126014 1613 - 11.730543 1614 - 7.632121 1615 - 11.773091 1616 - 5.778463 1617 - 11.556687 1618 - 7.682941 1619 - 11.576068 1620 - 6.273122 1621 - 11.635411 1622 - 7.804220 1623 - 12.053517 1624 - 6.008985 

For some time, I thought that it could be alignment problems - the rows of any matrix are aligned to 32 bytes when N%8==0 (the first question is why are 32 bytes in particular? SSE commands like movaps can work on 16B aligned data).

Another idea was that this might have something to do with the cache association (8-way for L1 and L2 and 12-way for L3 on my machine).

But then I noticed that for some values ​​of N , such as 1536 , unexpected bursts appear (although alignment should be excellent in these cases - 1536==256*6 , associativity is also not a problem) 1536==128*12==192*8 ). For instance:

 1504 - 4.644781 1512 - 4.794254 1520 - 4.768555 1528 - 4.884714 1536 - 7.949040 1544 - 5.162613 1552 - 5.083331 1560 - 5.388706 

Time is pretty consistent, so CPU spikes are not a problem. I am compiling code with optimizations enabled ( -O2 ). My ideas, unfortunately, are ending. What could be causing this behavior?

+5
source share
1 answer

The most important for your example is the CPU cache line size. The CPU is usually 64 bytes. Even if your program reads or writes 1 byte, the CPU reads / writes for all lines (64 bytes). That is why if your program is in cache lines, your performance is good. If it misses, there is additional overhead for reading / writing memory. The size of the L3 cache is not so critical.

for your code

 // all your stack variables are good. Compiler will optimize them well. for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { float sum = 0.0; for (int k = 0; k < N; k++) { sum = sum + a[i][k] * // here you are good, you read memory sequentially b[k][j]; // here, you are not good, every read comes from different cache line } c[i][j] = sum; // here doesn't matter, it is rare operation } } 

Like your case here . This presentation explains how to optimize such code and why it works this way. Hope you find everything you need.

image

0
source

Source: https://habr.com/ru/post/1261592/


All Articles