## Matrix multiplication speed issues

I am investigating how cache misses influence speed of computation. I know there are many algorithms better for multiplying two matrices (even simple exchange of two of the loops below would help), but please consider this code:

```
float a[N][N];
float b[N][N];
float c[N][N];
// ...
{
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
float sum = 0.0;
for (int k = 0; k < N; k++) {
sum = sum + a[i][k] * b[k][j];
}
c[i][j] = sum;
}
}
}
```

I have recompiled this code for many values of `N`

and measured time to run it. I expected to find a sudden increase of time at around `N=1250`

, at which point matrix `c`

no longer fits in cache (size of `c`

is then `1250*1250*sizeof(float)=6250000`

, or roughly 6MB, which is the size of my L3 cache).

Indeed, the general trend is that after that point, the *average* time roughly triples compared to extrapolated time from before. But the value of `N%8`

seems to have a huge influence on the result. For example:

```
1601 - 11.237548
1602 - 7.679103
1603 - 12.216982
1604 - 6.283644
1605 - 11.360517
1606 - 7.486021
1607 - 11.292025
1608 - 5.794537
1609 - 11.469469
1610 - 7.581660
1611 - 11.367203
1612 - 6.126014
1613 - 11.730543
1614 - 7.632121
1615 - 11.773091
1616 - 5.778463
1617 - 11.556687
1618 - 7.682941
1619 - 11.576068
1620 - 6.273122
1621 - 11.635411
1622 - 7.804220
1623 - 12.053517
1624 - 6.008985
```

For some time, I thought those might be alignment issues - rows of any matrix are aligned to 32 bytes when `N%8==0`

(first question - why 32 bytes in particular? SSE instructions, such as `movaps`

can work on 16B aligned data).

Another idea was that this could be somehow connected to cache associativity (8-way for L1 and L2 and 12-way for L3 on my machine).

But then I noticed that for some values of `N`

, such as `1536`

, there are unexpected spikes (even though the alignment should be excellent in those cases - `1536==256*6`

, the associativity being non-issue too - `1536==128*12==192*8`

). For example:

```
1504 - 4.644781
1512 - 4.794254
1520 - 4.768555
1528 - 4.884714
1536 - 7.949040
1544 - 5.162613
1552 - 5.083331
1560 - 5.388706
```

The timing is pretty consistent, so spikes of processor load are not a problem. I compile the code with optimizations turned on (`-O2`

). My ideas are unfortunately running out. What could be a reason of such behaviour?

Show source

## Answers ( 0 )