## Matrix multiplication speed issues

Question

I am investigating how cache misses influence speed of computation. I know there are many algorithms better for multiplying two matrices (even simple exchange of two of the loops below would help), but please consider this code:

``````float a[N][N];
float b[N][N];
float c[N][N];
// ...
{
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
float sum = 0.0;
for (int k = 0; k < N; k++) {
sum = sum + a[i][k] * b[k][j];
}
c[i][j] = sum;
}
}
}
``````

I have recompiled this code for many values of `N` and measured time to run it. I expected to find a sudden increase of time at around `N=1250`, at which point matrix `c` no longer fits in cache (size of `c` is then `1250*1250*sizeof(float)=6250000`, or roughly 6MB, which is the size of my L3 cache).

Indeed, the general trend is that after that point, the average time roughly triples compared to extrapolated time from before. But the value of `N%8` seems to have a huge influence on the result. For example:

``````1601 - 11.237548
1602 - 7.679103
1603 - 12.216982
1604 - 6.283644
1605 - 11.360517
1606 - 7.486021
1607 - 11.292025
1608 - 5.794537
1609 - 11.469469
1610 - 7.581660
1611 - 11.367203
1612 - 6.126014
1613 - 11.730543
1614 - 7.632121
1615 - 11.773091
1616 - 5.778463
1617 - 11.556687
1618 - 7.682941
1619 - 11.576068
1620 - 6.273122
1621 - 11.635411
1622 - 7.804220
1623 - 12.053517
1624 - 6.008985
``````

For some time, I thought those might be alignment issues - rows of any matrix are aligned to 32 bytes when `N%8==0` (first question - why 32 bytes in particular? SSE instructions, such as `movaps` can work on 16B aligned data).

Another idea was that this could be somehow connected to cache associativity (8-way for L1 and L2 and 12-way for L3 on my machine).

But then I noticed that for some values of `N`, such as `1536`, there are unexpected spikes (even though the alignment should be excellent in those cases - `1536==256*6`, the associativity being non-issue too - `1536==128*12==192*8`). For example:

``````1504 - 4.644781
1512 - 4.794254
1520 - 4.768555
1528 - 4.884714
1536 - 7.949040
1544 - 5.162613
1552 - 5.083331
1560 - 5.388706
``````

The timing is pretty consistent, so spikes of processor load are not a problem. I compile the code with optimizations turned on (`-O2`). My ideas are unfortunately running out. What could be a reason of such behaviour?

Show source