Basic Cache Optimization – Part 1

Decades of refinement have resulted in CPUs optimized for processing arrays of numerical data. A cache, then multiple caches, were introduced to ameliorate the growing disparity between between CPU and RAM speeds. To achieve the maximum possible CPU performance for a given calculation, the data necessary for the calculation must be loaded into the cache as quickly as possible and as few times as possible.

If all of the data required for a calculation fits into cache simultaneously, no cache-specific optimization is necessary. When the data required for a calculation is too large to fit into the cache, it must be processed in blocks. These blocks should contain all necessary data to complete a subset of the calculation. If possible, they should also contain the data necessary to complete the next subset of the calculation. Failing to do this efficiently will result in unnecessary RAM accesses and longer execution times.

Modern CPUs have multiple cores and multiple caches. Commonly, each core has its own cache (the L1 cache) and shares one or more other caches (L2, L3, perhaps others) with the remaining cores. This article uses vector dot product and matrix-matrix multiplication to demonstrate optimizing for the L1 cache. Depending on the data characteristics and potential processing speed of a particular calculation, optimizing for other caches may be useful.

Basic Example: Vector Dot Product

Some calculations may be inherently cache optimized. Vector dot product is one. Take two nine element vectors, X and Y, stored as one-dimensional arrays. The CPU has a six element cache, CA. To compute X • Y, corresponding elements of X and Y are multiplied together and the resulting products are summed. System memory initially looks like:


The dot product code starts with the first elements of X and Y, X[0] and Y[0]. When the code requests these from RAM, the memory manager fetches X[0] and the values stored in the next several sequential memory locations. These values are copied into cache. It then performs the same process for Y[0].

Once the data has been retrieved, the code stores 1 * 11 in a temporary variable. (Writing to cache will be covered in a separate article.) If the memory manager fetches three elements at a time, system memory now looks like:


The code moves on to the second pair of elements. When it requests X[1] and Y[1], the memory manager retrieves each from cache. This replaces two slow RAM accesses with two fast cache accesses. The third element pair, X[2] and Y[2], is also retrieved from cache.

When the code requests the fourth element pair, the memory manager has to read them from RAM. It performs one read for X’s elements 4, 5, and 6 and another read for Y’s elements 4, 5, and 6. These six values are copied into the cache. System memory now looks like:


Elements pairs 4, 5, and 6 can be multiplied without accessing RAM. Element pairs 7, 8, and 9 require two more RAM accesses. This entire dot product calculation requires six reads from RAM: two reads each for the first, fourth, and seventh element pairs. These six reads load the remaining element pairs as a side-effect.

This calculation is inherently cache-optimized. The vectors are stored as one-dimensional arrays. The array elements are sequential in RAM. The array elements are accessed in sequence. The memory manager’s default actions are optimal for this calculation.

A programmer could short-circuit this inherent optimization. If X • Y were computed as X[0] * Y[0] + X[3] * Y[3] + X[6] * Y[6] and so forth, the calculation would take much longer than necessary. Every element pair would require two reads from RAM. The calculation would go from six RAM reads to eighteen.

Unfortunately, many data structures do not work well with how the cache is filled.

Complex Example: Matrix-Matrix Multiplication

While programmers and researchers often use n-dimensional data structures, RAM is a one-dimensional array. Mapping a two-, three-, or more-dimensional array into one-dimensional RAM often leads to inefficient cache use. Matrix-matrix multiplication is a perfect example. If needed, please see A Quick Introduction to Matrix-Matrix Multiplication for a brief refresher.

Basic Multiplication

Matrices are generally thought of as two-dimensional arrays. But what appears on paper as:


looks like this in RAM:


The first element of A * B is the dot product of A’s first row and B’s first column. On paper, this is simple.


In RAM, it is not.


When the code begins calculating the dot product, it requests A[0,0] and B[0,0]. Using the same six element cache and memory manager that loads three sequential elements at a time, the cache looks like this:


The first three elements of A’s first row are now in cache. The first three elements of B’s first row are also in cache. Unfortunately, we need the first three elements of B’s first column. Only the first element of B’s first row is useful.

Since one row of A and one row of B cannot fit into cache at the same time, the best scenario for a single dot product calculation is four reads from RAM. Two reads per row of A and two reads per column of B. But, as shown, the data structure doesn’t allow this.

With this data structure, each dot product requires eight RAM accesses: one for every three elements of a row of A plus one for every element of a column of B. A * B requires 54 dot products. The multiplication calculation went from 216 RAM accesses to 432 RAM accesses. A modern CPU core can do hundreds if not thousands of operations in the time one RAM access takes.

First Optimization

If B’s columns were in rows, they would be in sequential memory. This would result in better cache use because the memory manager would automatically load three useful elements of B into cache per RAM access. Transposing a matrix swaps the rows and columns. BT represents transposed B. On paper, the new matrices look like:


They look like this in RAM:


This change turns the cache for the first portion of the first dot product into:


This reduces the RAM accesses per dot product to four. Since this is the calculated minimum number of accesses per dot product, the data now seems optimized for the cache.

But it isn’t. Cache usage for calculating a single dot product is optimized. But the calculation multiplies two matrices. The second element of A * B is the dot product of the first row of A and the second column of B. Using BT, the calculation is the dot product of the first row of A and the second row of BT.


This will take an additional four RAM accesses. Two of these will be spent loading the first row of A into the cache for the second time. Optimizing the low-level calculation, in this case dot product, improved performance. However, optimizing the calculation as a whole will cut the number of RAM accesses almost in half for this example.

To be continued in part two.

C#, Generics, and Optimizers

Generics are a handy feature of many languages. C# has had them since version 2.0. In some circumstances, their performance impact is minimal. In others, using them can cause significant slowdowns. Here is an example of the latter.

My matrix interface requires an indexer. Indexers can adversely effect performance, so my matrix classes have indexer versions of each of the multiplication methods for benchmarking purposes. This was done even though the code is shared, as I suspected there would be a performance price for using generics.

Having determined the performance cost of indexers, I decided to convert the methods to generics. I thought the additional cost of generics would be fairly low and it would be handy to easily multiply different matrix classes. The methods went from this format:

public MatrixAA Multiply(MatrixAA multiplier) { .... }

to this:

public static U Multiply<U, V>(U matrixA, V matrixB)
    where U:IMatrix<U>, new()
    where V:IMatrix<V> { .... }

The interface itself uses generics as the required methods must return their own type. New() is required for U so the product matrix can be created.

The performance hit was incredible. The above generic is 82% slower than the non-generic. The parallel version is 65% slower. More sophisticated multiplication methods show similar degradation.

Ildasm‘s disassemblies of the generic and non-generic versions of the method imply the issue is either the use of an interface or the use of a generic interface. The non-generic version compiles to late-bound calls to MatrixAA.get_Item(). The generic version compiles to late-bound calls to IMatrix<U>.get_Item() and IMatrix<V>.get_Item().

The speed difference could be the overhead of getting from IMatrix<U>.get_Item() to MatrixAA.get_Item() at run-time. Or the JIT optimizer has trouble optimizing the generic version of the call. Either way, the performance loss is too great for me to use the generic versions. The indexer methods need to show the cost of using an indexer, not the cost of using generics in this situation.

Implementing a High Performance Matrix-Matrix Block Multiply

A basic block matrix-matrix multiply code divides the operands into logical sub-matrices and multiplies and adds the sub-matrices to calculate the product. For A * B = C and a block size of B:

for (int blockRow = 0; blockRow < A.Rows; blockRow += B) {
    for (int blockColumn = 0; blockColumn < B.Columns; blockColumn += B) {
        for (int blockK = 0; blockK < A.Columns; blockK += B) {
            for (int row = blockRow; row < MIN(blockRow + B, A.Rows); ++row) {
                for (int column = blockColumn; column < MIN(blockColumn + B, B.Columns); ++column) {
                    for (int k = blockK; k < MIN(blockK + B, A.Columns); ++k) {
                        C[row, column] += A[row, k] * B[k, column];

While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly. Even with significant optimization, this code cannot overcome the limitations of most commonly used matrix data structures.

In order to implement a high performance matrix-matrix block multiply, the data structure and implementation code must be carefully optimized. In this article, the block size is B, the matrices are A * B = C, and the matrix elements are double-precision (64-bit) floating point numbers.

The Data Structure

Instead of relying on the compiler to optimize the basic block multiply code, the matrix data will be stored directly in blocks. While a matrix M has X logical rows and Y logical columns of double-precision elements, the actual structure will be an array of arrays. M has CEILING(X / B) rows and CEILING(Y / B) columns. Each physical element of M is a one-dimensional array of doubles with B * B elements. While using an array of arrays of an existing matrix class may be tempting, overall performance is unlikely to be adequate.

This design will result in a small amount of wasted memory. The worst case, where both the number of rows and number of columns are integer multiples of B plus one additional element, will waste X * (B – 1) * sizeof(double) + Y * (B – 1) * sizeof(double) bytes. For a 100,000 x 100,00 matrix, this adds  ~47 MB to the matrix’s ~75 GB size.

The Multiplication Algorithm

Since a matrix of matrices is still a matrix, the standard matrix-matrix multiplication algorithm holds. Each element C[i, j] is the dot product of the i-th row of A and the j-th column of B. The dot product of two vectors of matrices is the sum of the product of the corresponding vector elements.

The Implementation Code

// C99/C++-style code. If using C#, substitute double[] for double*

for (int blockRow = 0; blockRow < A.blockRowCount; ++blockRow) {
    for (int blockColumn = 0; blockColumn < B.blockColumnCount; ++blockColumn) {
        double* subC = C[blockRow, blockColumn];
        for (int blockK = 0; blockK < A.blockColumnCount; ++blockK) {
            double* subA = A[blockRow, blockK];
            double* subB = B[blockK, blockColumn];
            for (int row = 0; row < B; ++row) {
                for (int column = 0; column < B; ++column) {
                    for (int k = 0; k < B; ++k) {
                        subC += subA[row * B + k] * subB[k * B + column];

Average single-threaded performance should be near 98% of theoretical, non-SIMD maximum for matrices over 2000 x 2000. Average multi-threaded performance should be over 95% for matrices over 3000 x 3000. Performance for square matrices should increase as matrix size increases.

If these performance levels are not reached via compiler options, some hand-optimization may be necessary. Once the optimal block size is known, unrolling the k loop will provide a significant speed improvement. Changing the order of the blockRow, blockColumn, and blockK loops may improve performance by a few percentage points.

Determining Block Size

Experimentation will determine the optimal block size. Sqtr(L1 Cache Size / (4 * sizeof(double))) is a good place to start.

Parallelization Note

When parallelizing the code, special attention must be paid to the loop design. The loop design must guarantee each result sub-matrix subC is updated by a single thread. If the design itself cannot guarantee this, explicit locking must be used. If this is not done, random errors may result. These errors may only involve a tiny number of elements out of millions or billions. The test system must include comparing every element of the result to the result of a known-good matrix-matrix multiplication routine.

Final Notes

This combination of data structure, algorithm, and implementation code provides a high-performance, non-SIMD matrix-matrix multiplication routine. Maintaining 90%+ performance using SIMD instructions may require optimizations that go beyond what is possible in straight C/C++/C#.