## Effect of cache misses on time of matrix multiplication

Question

I am trying to do matrix multiplication, starting from a small matrix size and increasing it gradually, in hope to observe how time abruptly changes as soon as matrix does not fit in the cache anymore. But to my disappointment, I always get a very smooth graph following apparently the same function. I have tried to start with the matrix size as small as `4x4` and increase it gradually till `3400x3400` which equals `11MB` integer values but I still see no change in the time function. May be I am missing some key point here. Any help would be greatly appreciated.

Here is my C++ code:

``````long long clock_time() {
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
}

int main()
{
for(int matrix_size = 100; matrix_size < 3500; matrix_size += 100)
{
int *A = new int[matrix_size*matrix_size];
int *B = new int[matrix_size*matrix_size];
int *C = new int[matrix_size*matrix_size];

long long start = clock_time();

for(int i = 0; i < matrix_size; ++i)
for(int j = 0; j < matrix_size; ++j)
for(int k = 0; k < matrix_size; ++k)
{
C[i + j*matrix_size] = A[i + k*matrix_size] * B[k + j*matrix_size];
}

long long end = clock_time();
long long totalTime = (end - start);

std::cout << matrix_size << "," << totalTime << std::endl;

delete[] A;
delete[] B;
delete[] C;
}

std::cout << "done" ;

return 0;
}
``````

Here is a sample graph of the data I am getting:

Update: After suggestions from Zheyuan and Frank, I am not initializing my matrices with values `i+j` and dividing time by `2*N^3`

``````for(int i = 0; i < matrix_size; i++)
{
for(int j = 0; j < matrix_size; j++)
{
A[i + j * matrix_size] = i+j;
B[i + j * matrix_size] = i+j;
B[i + j * matrix_size] = i+j;
}

}
``````

Here is the result:

Update 2: After interchanging the `i` and `j` loops:

Show source

## Answers to Effect of cache misses on time of matrix multiplication ( 2 )

1. Well, you will surely observe a cubic curve for timing. Assuming you work with two `N * N` square matrices, then matrix multiplication has complexity, or amount of floating point operations (FLOP) at `2 * N ^ 3)`. As `N` increases, the increase in FLOP dominates the time increase, and you will not observe the latency issue easily.

If you want to investigate the sheer effect of latency, you should "normalize" your timing by amount of FLOP:

``````measured time / (2 * N ^ 3)
``````

Or alternatively:

``````(2 * N ^ 3) / measured time
``````

The former is average time spent per FLOP, while the latter gives you the FLOP per second, often called FLOPs in literature. FLOPs is the major indicator of performance (at least for scientific computation). It is expected, that as `N` increases, the former measure will see upside jump (increased latency), while the latter measure will see downside jump (degraded performance).

I am sorry I don't write C++ so am not able to modify your code (but it is easy, as you just need to do an extra division by `2 * N ^ 3`). I once did the same experiment with C code, and here is my result on Intel Core 2 Duo. Note, I am reporting MFLOPs, or `10 ^ 6` FLOPs. The plot is actually produced in R software.

My above observation really assumes you get everything else correct. But actually, it does not seem so.

Firstly, matrix multiplication is:

``````C[i + j*matrix_size] += A[i + k*matrix_size] * B[k + j*matrix_size];
``````

Note the `+=` not the `=`.

Secondly, your loop nest is somehow badly designed. You are doing matrix multiplication `C = A * B`, where all matrices are stored in column major order, so you should watch out for loop nest order to ensure you always have stride-1 access in the innermost loop. It is well known that `j-k-i` loop nest is optimal in this situation. Thus, consider the following:

``````for(int j = 0; j < matrix_size; ++j)
for(int k = 0; k < matrix_size; ++k)
for(int i = 0; i < matrix_size; ++i)
{
C[i + j*matrix_size] += A[i + k*matrix_size] * B[k + j*matrix_size];
}
``````

Thirdly, you start from matrix size `100 * 100`, that is already outside L1 cache, which is mostly 64KB. I suggest you start from `N = 24`. Some literature shows that `N = 60` is the roughly the boundary value for such cache.

Fourthly, you need repeat multiplication several times to eliminate measurement bias. At the moment, for each trial `N` (or `matrix_size` as in your code), you do the multiplication once and measure the time. This is not accurate. For small `N` you get fake timing. How about repeating `(1000 / N + 1) ^ 3` times?

• when `N` is very small, you repeat great many times;
• as `N` increasingly approaches `1000`, you repeat fewer times;
• when `N > 1000`, you essentially do the multiplication once.

Of course, don't forget that you need to divide measured time by repeated times.

There are of course other places where you can optimize you code, like using constant register and eliminating integer multiplication in address computation, but they are less important hence not covered. Initialization of arrays is also skipped, as it was already raised in Frank's answer.

2. You are not initializing the data inside of your arrays, so your system probably allocated a single page of copy-on-write memory and mapped all of the arrays to that.

In short, A and B are always taking a total of 4096 bytes of hardware memory. And since caching is done based on hardware addresses (as opposed to virtual), you are effectively always in cache.

Initializing A and B with random data will force allocation of actual hardware memory like you desire.