I wrote a matrix-matrix(32bit floats) multiplication function in C++ using intrinsics for large matrices(8192x8192), minimum data size is 32B for every read and write operation.
I will change the algorithm into a blocking one such that it reads a 8x8 block into 8 YMM registers and do the multiplications on the target blocks rows (another YMM register as target) finally accumulating the 8 results in another register and storing into memory.
Question: Does it matter if it gets 32B chunks from non-contiguous addresses? Does it change performance drastically if it reads like:
Read 32B from p, compute, read 32B from p+8192 (this is next row of block), compute,
Read and compute until all 8 rows are done, write 32B to target matrix row p3
Read 32B from p, compute, read 32B from p+32, compute, read 32B from p+64......
I mean the reading speed of memory, not the cache.
Note: Im using fx8150 and I dont know if it can read more than 32B in single operation.
It will probably give you better performance to have one contiguous buffer (at the very least, it's not worse!).
How big the performance difference is will depend on a large number of factors (and of course, if you allocate a bunch of 32 byte blocks, you are quite likely to get "close-together" lumps of memory, so the caching benefit will still be there. Worst case is if every block is in a different 4KB segment of memory, but if you have a few bytes of "empty space" between each block, not that big a deal.
Like so many other performance questions, it's quite a lot to do with the exact details of what the code does, memory types, processor type, etc. The only way to REALLY find out, you will need to benchmark the different options...