I'm looking into ways to speed up matrix multiplication (C = AB) using caches. Currently, I'm coding in c, so the natural way that the matrices are stored is row-major. I generally understand why the tiling speeds up the code, because you are able to store an entire block of the B matrix and an entire row of the C matrix into a single cache, and then these values are reused over and over again. But, I'm confused on how exactly all of the memory gets stored into the cache.
I'm just starting to learn about all of this stuff, so I apologize if this question is not clear, but I want to understand exactly how the caches are loaded. For the purposes of this, I'm just going to assume that the cache can hold, say, 10 integers. When you ask for, say, the first index value of an array (A, say), would it then
1) load up A, A, ..., A into the cache automatically?
2) Or, does it load A, then when you ask for A, load A into the next spot, etc.
The motivation behind this question is that I'm interested in how exactly you want to store the matrices in memory. I was under the impression that 1) is how the caches work. But, in order to utilize the tiling algorithm, do you need to somehow store the B matrix in a form that groups all of the values that are in the same block? It seems like that would be the only way to really get an advantage in speed out of the code, if 1) is correct. If 2) is correct, then I think the tiling does offer a big advantage without worrying about how the matrices are stored.
Of course, I could just be totally off-base and neither of the above 1) and 2) is correct, if so, I would really appreciate a good explanation of how it actually does work.