Basic Cache Optimization – Performance Comparisons

Benchmarking Software

I have written a matrix-matrix multiplication testbed. The source code is available on GitHub. It explores far more than what was covered in this series, including the impact of accumulators, indexers, parallelism, underlying data structures, and so forth. The system does manage maximum theoretical, non-SIMD CPU performance. The code is in C# but easily converts to C++ and the concepts are applicable to most languages. If there is enough demand, a C++ version of the test bed will be released. The solution loads and compiles under Xamarin Studio and runs under Mono.

Test System

The test system is a 3.4 GHz Core i7-2600K “Sandy Bridge” CPU with 16 GB of RAM. Turbo Boost was disabled to allow a more accurate estimate of the potential performance of the system. Hyper-Threading was also disabled. Only the operations required for multiplication were measured. Creating and filling test matrices with random data was not included in the performance figures. For methods that required a matrix transposition, the transposition time was included. All matrix elements are double-precision (64-bit) floating point values.

Performance Figures

The performance figures compare a straightforward matrix-matrix multiplication routine to a transposition routine and a block multiplication routine. The latter two correspond to the first and third optimizations. All routines use accumulators.

I have not written a version of the second optimization for reasons beyond the scope of this article. The first two routines use an array of arrays to hold the matrix elements. The third uses an array of arrays of 1024 element one-dimensional arrays. This is the equivalent of a 32 x 32 block size.

Performance figures shown are a percentage of the test system’s theoretical maximum possible MFLOPS without SIMD. The theoretical maximum is approximately 3400 MFLOPS for single-core routines and 13600 MFLOPS for four-core. Both values were treated as exact for calculation purposes.

The MFLOPS of each operation was calculated using the average time from 100 runs for 1000 x 1000 matrices, 20 runs for 5000 x 5000, and 5 runs for 10000 x 10000 and 15000 x 15000. The number of floating point operations was calculated using the minimum possible for a square matrix of size N x N: 2 * N^3 – N^2. This slightly understates the actual number of operations performed, thus understating the actual MFLOPS.

1000 x 1000



Given the performance improvement, the complexity of the block algorithm compared to the transpose algorithm may not seem to justify the time investment. However, once larger matrices are multiplied, the benefits become apparent. 



For comparison purposes, here is Math.Net‘s Intel MKL-based matrix-matrix multiplication vs the block algorithm. This chart shows MFLOPS, not percentage of maximum performance. Both algorithms use all four cores. The theoretical maximum of the test machine using SIMD on four cores is 109 GFLOPS.



Basic Cache Optimization – Part 4


The first three parts of this series walked through optimizing dot products and matrix-matrix multiplication for the cache. The general guidelines for cache optimization are partitioning, proximity, load speed, and load frequency.


The calculation must be divisible into subcalculations. Both the dot product and matrix-matrix multiplication calculations can be divided. Dot product can be done in groups and matrix-matrix multiplication can be done by multiplying submatrices.

The data must be divisible into blocks. The blocks must be sized such that as many as needed fit into cache simultaneously. Both examples were readily divisible. The dot products were divided into three element blocks. The matrices were divided into 3×3 submatrices. Two three element dot product blocks fit into cache at once, as did two 3×3 submatrices.


Partitioning the data into blocks isn’t enough. When partitioned, the elements necessary for the divided calculation must all lie within the same block. Both examples had good proximity. The three element dot product blocks contained all the elements needed for that part of the dot product. The 3×3 submatrices for matrix-matrix multiplication contained all of the elements necessary for their multiplication.

Load Speed

Moving the data from the blocks into cache must be fast. This can be done in two ways. The data can be stored in RAM in structures designed so the memory manager automatically loads a block into cache. This requires using specialized data structures and may cause problems when moving data from one system to another.

The alternative is to copy a block’s elements from RAM into a special data structure the memory manager automatically loads into cache. Writing a high speed data copy routine is not trivial. Many languages do not have the features necessary to write one. Writing the routine in one language and calling it from another involves overhead. This overhead may slow the call down more than writing it in another language sped it up.

The dot product example had great load speed. The elements were in one-dimensional arrays. The memory manager automatically loaded the elements into cache in three element blocks.

The matrix-matrix multiplication calculation, as written, has horrible load speed. While the code may think of the data as being in submatrices, the elements in RAM are still in a giant, row-major, one-dimensional array. The data structure must be modified to store the elements in actual submatrices or special code will be needed to copy the submatrix elements from RAM into cache. Please see Implementing a High Performance Matrix-Matrix Block Multiply for one way of storing the matrix elements in submatrices.

Load Frequency

The final piece is ensuring the blocks as a group are loaded into cache as few times as possible. This requires looking at the big picture. Minimizing loads for a part of the calculation may prevent the calculation as a whole from being as fast as possible.

The dot product calculation automatically had the lowest possible load frequency. Each three element block was only needed once and was only loaded into cache once. Unfortunately, knowing how to optimize one part of a calculation may result in suboptimal performance for the calculation as a whole.

In the first matrix-matrix multiply, using the optimized dot product routine resulted in inefficient cache use. Each block of a row of A was needed multiple times and was loaded multiple times. The second optimization changed the order in which matrix-matrix multiplication was calculated, which resulted in each block of A being loaded only once. The third optimization also did this.


The only way to approach the maximum calculation potential of a CPU is by optimizing the calculation for the cache. The high clock speeds of modern CPUs have made RAM access prohibitively expensive. Even non-SIMD calculations written without cache optimization run more than 10x slower than a cache-optimized version.

Running an inefficient calculation on many CPUs is often considered a better solution than cache optimization. While this may raise throughput, latency remains an issue and increases the cost of mistakes. Unknowingly tainting a 10 minute calculation costs 10 minutes. Unknowingly tainting just one of ten 100 minute calculations costs 100 minutes.

SIMD optimization is predicated on cache optimization. The pinnacle of CPU performance is out of reach for those who do not optimize for the cache. The difference in performance between a SIMD-optimized calculation and one not optimized for the cache can be a factor of 1000. This is the difference between taking over eight hours and taking five minutes. 


Following the guidelines above may result in massive speed improvements for a code. If a programmer plans on using SIMD, cache optimization is the first step. However, cache optimization has some perils.

Before doing any cache optimization, a comprehensive test system must be available. If a tested, known-good reference routine is not available, start with the simplest implementation of the calculation. Once that is well-tested, use it as a validator. Cache optimization often involves data structures that create fencepost and/or symmetry problems. Tests must cross over partition boundaries, use non-integer multiples of block sizes, and so forth.

Partitioning schemes must take into account the nature of the data. Does the data naturally partition itself into the chosen block size? Will it continue to do so? Does partitioning happen invisibly to the end-programmer? Or do they have to code for it? All of these questions potentially limit the portability and lifespan of the code.

Cache-optimized calculations will be mathematically identical to the original calculation. But they may not have an identical number of operations. When working with floating point values, this may result in small differences between the two calculations. Additionally, care must be taken to avoid significant digit problems. If the code must match a reference calculation exactly, cache optimizations may be limited or impossible.

Intentionally picking the order of operations in a calculation allows the best optimization the programmer is capable of envisioning. Unfortunately, this increases development time and limits the optimization to what the programmer can conceptualize. Whether this level of optimization is better or worse than what the compiler would produce depends on the compiler and the programmer. Additionally, some compilers may produce better output than a programmer only when “tricked”. Learning how to “trick” the compiler takes time and these tricks may change between minor releases of the compiler. Whether it is better to optimize the code by hand or play compiler games depends on the programmer, the compiler, and the situation.

Different CPUs use different cache sizes. As of 2013, almost all x86-compatible laptop, desktop, and server CPUs use 32 KB L1 data caches. Many tablet and phone CPUs use 16 KB data caches. Some older ones use 8 KB data caches. Whether this is an issue depends on the purpose of the code.

If data partitioning is done via a custom data structure, custom support routines will be needed. Using generic support routines via indexers or mapping objects may leave the system with one amazingly fast calculation and many more extremely slow ones. Writing and testing custom support routines and converters is non-trivial.

This series deals with optimizing for the L1 data cache. When writing code destined for SIMD optimization, this may not be enough. Optimizing beyond the L1 cache is beyond the scope of this article. To get an idea of what is involved, please follow the links from Kazushige Goto’s Wikipedia entry.

Please continue on to the performance comparisons.

Basic Cache Optimization – Part 3

This is part three in a series on basic cache optimization. Please start with part 1.

The second optimization improved the speed of the overall calculation. But, like the first optimization, it required transposing the second matrix. Transposition requires time and memory. However, dividing the rows into blocks was a good start.

Third Optimization

In order to explain the third optimization, we need a cache more like one used in a real CPU. The cache now holds eighteen elements. The memory manager now pulls in nine elements at a time. The example matrices are the same. On paper:


and in RAM:


In the second optimization, each row was divided into two halves. Each half was used to calculate a partial dot product. As the calculation continued, the partial dot products were summed. Once all of the partial dot products were calculated and summed, C = A * B was complete.

Dividing the rows into halves divided, or partitioned, them into blocks. In this case, each block was one row high by three columns wide. While BT wasn’t explicitly partitioned by the code, the memory manager effectively partitioned BT by loading three elements of each row at a time.

A larger cache can hold more elements. The 18 element cache could be used to hold an entire row of A and BT. But that still requires transposing B to create BT. It would be better if B didn’t need to be transposed in order to use the cache effectively. The two matrices are once again A and B, both on paper:


and in RAM:


The second optimization divided each row into three element blocks. This was possible because a dot product can be calculated in parts. However, a row divided into blocks is also a matrix. A three element block is a 1 x 3 matrix.

This means A and B can be divided into sizes other than 1 x 3. The formula for multiplying a matrix that has been divided into blocks is the same as the normal matrix-matrix multiplication formula. The only difference is that the elements of A, B, and C are now matrices. Partitioning A and B into 3×3 submatrices:


turns them into a 3 x 2 matrix and a 2 x 2 matrix:


Since multiplying two matrices that are composed of submatrices follows the normal multiplication rules, A * B = C looks like:


Following the rules, C[0,0] = A[0,0] * B[0,0] + A[0,1] * B[1,0]. Matrix multiplication has already been defined, so calculating A[0,0] * B[0,0] and A[0,1] * B[1,0] is simple. Adding two matrices requires adding corresponding elements, which is also trivial.

One advantage of this is the submatrix size can be tailored to the cache size. Two 3 x 3 matrices will fit into an eighteen element cache. This makes multiplying two 3 x 3 matrices very fast, much like dividing rows into three element pieces made calculating dot products very fast. 

Another advantage is the order of operations can be tailored to use the cache efficiently. Just like the second optimization’s dot products, the submatrix multiplications can be partially calculated. Instead of loading A[0,0] and B[0,0] into cache and multiplying them, then loading A[0,1] and B[0,1] and multiplying them, the code can load A[0,0] into cache and keep it there. It multiplies A[0,0] by B[0,0] and stores a temporary result. Then it multiplies A[0,0] by B[0,1] and stores a temporary result. Only then does it move on to A[0,1]. This minimizes how many times each submatrix is loaded into cache.

The final advantage is this simplifies writing a fast matrix-matrix multiplication routine. Instead of trying to come up with code that is fast for any size matrix, time can be spent coming up with an insanely fast routine for multiplying 3 x 3 matrices. Or 10 x 10. Or whatever size fits the cache best. Once this insanely fast routine is complete, any pair of matrices can be partitioned into blocks of this optimal size.

Unfortunately, getting the data in a block from RAM into cache is not trivial. If the data is stored in a standard row-major or column-major array, special code is needed to copy the data from the main matrix array in RAM to specialized submatrix arrays that can be loaded into cache. If this code is slow, block multiplication will be slower than transposing the second matrix. Fast array copy code often requires programming language features unavailable in most recent programming languages.

The other way of handling this is to store the data in submatrix-major order. This has its own drawbacks that are covered in the conclusion. For an implementation that works in most languages, please see Implementing a High Performance Matrix-Matrix Block Multiply.

Conclusion and concerns are in part four.

Basic Cache Optimization – Part 2

This is part two in a series on basic cache optimization. Please start with part 1.

The first optimization improved the dot product calculation. But the order in which the dot products were calculated resulted in each partial row of A being loaded into cache multiple times. The overall matrix multiplication calculation would run faster if each partial row of A was used as many times as possible before overwriting it in cache.

Second Optimization

A and B are the same as in part one. The cache, CA, still holds six elements. The memory manager retrieves three elements from RAM at a time. As in the first optimization, the calculation has been modified to use transposed B, BT. On paper, the matrices are:


In RAM, they are:


When multiplying two matrices, C = A * B, each element of C, C[i, j], is the dot product of the i-th row of A and the j-th row of BT.

There is no rule that a dot product must be calculated all at once. The formula could break each dot product into sections. C[i, j] is the dot product of the first three element pairs of the i-th row of A and the j-th row of BT plus the dot product of the second three element pairs of the i-th row of A and the j-th row of BT.

In terms of reading data from RAM, the first optimization did this. Each read from RAM brought in three useful elements. Making the calculation explicitly work with three element pairs at a time lets the programmer work better with the cache. Working with three element pieces, or blocks, of each matrix row lets the programmer control, to an extent, what blocks are kept in cache and for how long.

If the code calculates the dot products of all the first three element pairs, then all of the second three element pairs, the number of RAM accesses is reduced. When C[0,0] is calculated, the first three elements of the first rows of A and BT are loaded into cache and their dot product is calculated. This is the same as in the first optimization:

NewImage NewImage

Instead of continuing on to the next three elements of the first rows of A and BT, the code saves this intermediate dot product. The code then calculates the dot product of the first three elements of the first row of A and the first three elements of the second row of BT.


The cache is now:


This process is repeated using the first three elements of the remaining rows of BT. The code is now finished with the first three elements of the first row of A. They were loaded into cache once, used, and can be overwritten.

The code then calculates intermediate dot products for the second three elements of the first row of A and the second three elements of each row of BT. These intermediate dot products are added to the ones stored earlier. The result of this set of calculations is the entire first row of C.

This process is repeated for the remaining rows of A. Once these are complete, all of C has been filled in. This change in calculation order yields a large improvement in the number of RAM accesses.

For the example multiplication, A has nine rows and six columns. B has six rows and six columns. C has nine rows and six columns. The cache, CA, holds six elements and the memory manager loads three elements into cache at at time. Each of the methods requires the following number of RAM accesses.

Method          RAM Accesses
Basic           8 * 54 = 432
First Opt.      4 * 54 = 216
Second Opt.     14 * 9 = 126

The first and second optimizations require transposing B. This takes time and uses memory. There is a better way of optimizing matrix multiplication for the cache.

To be continued in part three.