Compiler Optimizations

In Implementing a High Performance Matrix-Matrix Block Multiply (IaHPM), I listed the basic block multiplication code and said:

While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly.

To demonstrate how well compilers optimize matrix-matrix multiplication codes, including basic block multiplication, I benchmarked four versions of my matrix-matrix multiplication test bed. Under Linux (Fedora Core 18), I compiled the C++ version using GNU’s G++ 4.7 and Intel’s ICPC 13.1.1 from Composer XE 2013 for Linux. Under Windows 7 SP1, I compiled the Visual C++ and C# versions using Visual Studio 2012.

Both operating systems were fully updated when I ran the benchmarks and were run on the same bare hardware. ICPC and G++ used the -O3 flag. Visual C++ used the /O2 flag and architecture-specific optimizations were not specified. C# had optimization on. All of these compilers can use SIMD instructions to a greater (ICPC) or lesser (C#) extent. Given the test results, I suspect ICPC was using SIMD in a couple test cases. I may revisit these tests with SIMD explicitly turned off in the future.

The tests

I benchmarked three matrix sizes: 1000 x 1000, 5000 x 5000, and 15000 x 15000. For the 1K and 5K tests, I used three element storage types: one-dimensional array (1D), array of arrays (AA), and array of arrays of 1024 element one-dimensional arrays (FM).

For 1D and AA, I ran three multiplication methods. Transpose, where the second matrix is transposed and a modified multiplication algorithm is used. Block, using the basic block multiply code presented in IaHPM. Transpose Block, where the second matrix is transposed and a modified block multiply code is used. The FM multiplication used an algorithm similar to the modified block multiply algorithm presented at the end of IaHPM. For the 15K tests, I only ran the fastest method for each of the Windows compilers.

All algorithms used accumulators. Block routines that use MIN had the MIN calculated outside of the for loop test. The FM routines had the inner loop unrolled. Values slightly over 100% may be due to inaccurate CPU speed values. Values significantly over 100% may be due to SIMD instruction use.

1000 x 1000

Screen Shot 2013-05-15 at 5.21.55 PM

The 1000 x 1000 case is somewhat trivial: all of the elements necessary for a dot product will fit into the L1 cache. ICPC does very well at optimizing the transpose methods. Note that C# did not optimize the block methods very well. G++ and VC++ recognize the block code: the block routines are nearly the same speed as the basic transpose method.

5000 x 5000

Screen Shot 2013-05-15 at 5.22.40 PM

ICPC optimized the transpose block methods well. However, G++ and VC++ take the lead using the specialized FM data structure and routine.

15000 x 15000

Screen Shot 2013-05-15 at 5.22.53 PM

ICPC is still in third place, but not by much. C# makes a surprisingly strong showing. I may update this chart with the VC++ and C# values for the other multiplication methods in the future.


Is creating specialized data structures and calculation methods worth the effort? It depends on the situation. For matrix-matrix multiplication, probably not. ICPC did a great job on the transpose block algorithm. This allows the use of a one-dimensional array, which allows many standard libraries to be used.

Note: Please do not use your own matrix-matrix multiplication routines in a production system without having an amazingly good reason to do so. Use ATLAS, Intel’s MKL, cuBLAS, Math.Net, or some other BLAS implementation.

The critical question when it comes to optimizing compilers is: how easy is it to get an optimizing compiler to optimize your code. Compilers only optimize situations they recognize. If it optimizes your code as written, that’s great. If it doesn’t, is it better to spend time working with the compiler or creating better data structures and code?

There isn’t an easy answer to this question. If your routine has to work across multiple compilers, it may be better to work on the implementation details. If you know you’ll be using an optimizing compiler on a regular basis, it may be better to learn the intricacies of that compiler. If processing speed is the competitive advantage for your research or business, it may be best to do both.

To get an optimizing compiler to optimize a piece of code, the code may need to be written in a specific way. Additionally, optimal performance may require testing different combinations of compiler flags. As the above results show, just putting an existing piece of code into an optimizing compiler using a “normal” set of compiler flags does not guarantee good results.

C++, Templates, and Optimizers

Using C# generics can cause performance issues. When I started porting my matrix-matrix multiplication test bed to C++, I was curious if templates provided better performance. I had hoped C++’s compiled nature and more robust optimizer would make templates run nearly as fast as regular code.

But they don’t. Additionally, the performance hit is split differently than in C#. In C#, most of the performance loss is in going from the indexer method to the generic method. In C++, most of the performance loss is in going from the regular method to the indexer method.

My C++ matrices all inherit from the MatrixBase virtual class. Three indexer methods, GetValue, SetValue, and AddValue, allow for generic matrix-matrix multiplication routines. All three are implemented as inline methods in every class header file.

The execution time of multiplying two 1000 x 1000 double-precision matrices 20 times is disappointing. The code was compiled in Visual C++ under Visual Studio 2012 with /Ob2, /O2, /Oi, and /Ot. All listed methods use accumulators, store data in arrays of arrays, and run in parallel across four cores. The figures shown are elapsed time in seconds for the run.

Acc            5.361
Acc-Indexer   29.939
Acc-Template  40.211

Other multiplication and storage methods had similar differences. C# seemed to do well inlining the indexers then had an optimizer issue implementing the generics.

VC++ seemed to have issues inlining the indexers, then did an OK job implementing the templates. Why? Is this a compiler-specific issue? I hope to test this in g++ and ICL, though this is low on my priority list.

Since templates are not critical to the core sections of my code, I will not be going through the generated assembly code. Templates work well in the non-critical sections. I recommend using templates to simplify testing and in the routines that organize benchmarking. But I can’t recommend them for performance-critical applications, at least not using Visual C++.

These performance differences surprise me and could be my fault. If you have a suggestion to improve template performance, please leave a comment.

Basic Cache Optimization – Performance Comparisons

Benchmarking Software

I have written a matrix-matrix multiplication testbed. The source code is available on GitHub. It explores far more than what was covered in this series, including the impact of accumulators, indexers, parallelism, underlying data structures, and so forth. The system does manage maximum theoretical, non-SIMD CPU performance. The code is in C# but easily converts to C++ and the concepts are applicable to most languages. If there is enough demand, a C++ version of the test bed will be released. The solution loads and compiles under Xamarin Studio and runs under Mono.

Test System

The test system is a 3.4 GHz Core i7-2600K “Sandy Bridge” CPU with 16 GB of RAM. Turbo Boost was disabled to allow a more accurate estimate of the potential performance of the system. Hyper-Threading was also disabled. Only the operations required for multiplication were measured. Creating and filling test matrices with random data was not included in the performance figures. For methods that required a matrix transposition, the transposition time was included. All matrix elements are double-precision (64-bit) floating point values.

Performance Figures

The performance figures compare a straightforward matrix-matrix multiplication routine to a transposition routine and a block multiplication routine. The latter two correspond to the first and third optimizations. All routines use accumulators.

I have not written a version of the second optimization for reasons beyond the scope of this article. The first two routines use an array of arrays to hold the matrix elements. The third uses an array of arrays of 1024 element one-dimensional arrays. This is the equivalent of a 32 x 32 block size.

Performance figures shown are a percentage of the test system’s theoretical maximum possible MFLOPS without SIMD. The theoretical maximum is approximately 3400 MFLOPS for single-core routines and 13600 MFLOPS for four-core. Both values were treated as exact for calculation purposes.

The MFLOPS of each operation was calculated using the average time from 100 runs for 1000 x 1000 matrices, 20 runs for 5000 x 5000, and 5 runs for 10000 x 10000 and 15000 x 15000. The number of floating point operations was calculated using the minimum possible for a square matrix of size N x N: 2 * N^3 – N^2. This slightly understates the actual number of operations performed, thus understating the actual MFLOPS.

1000 x 1000



Given the performance improvement, the complexity of the block algorithm compared to the transpose algorithm may not seem to justify the time investment. However, once larger matrices are multiplied, the benefits become apparent. 



For comparison purposes, here is Math.Net‘s Intel MKL-based matrix-matrix multiplication vs the block algorithm. This chart shows MFLOPS, not percentage of maximum performance. Both algorithms use all four cores. The theoretical maximum of the test machine using SIMD on four cores is 109 GFLOPS.



Basic Cache Optimization – Part 4


The first three parts of this series walked through optimizing dot products and matrix-matrix multiplication for the cache. The general guidelines for cache optimization are partitioning, proximity, load speed, and load frequency.


The calculation must be divisible into subcalculations. Both the dot product and matrix-matrix multiplication calculations can be divided. Dot product can be done in groups and matrix-matrix multiplication can be done by multiplying submatrices.

The data must be divisible into blocks. The blocks must be sized such that as many as needed fit into cache simultaneously. Both examples were readily divisible. The dot products were divided into three element blocks. The matrices were divided into 3×3 submatrices. Two three element dot product blocks fit into cache at once, as did two 3×3 submatrices.


Partitioning the data into blocks isn’t enough. When partitioned, the elements necessary for the divided calculation must all lie within the same block. Both examples had good proximity. The three element dot product blocks contained all the elements needed for that part of the dot product. The 3×3 submatrices for matrix-matrix multiplication contained all of the elements necessary for their multiplication.

Load Speed

Moving the data from the blocks into cache must be fast. This can be done in two ways. The data can be stored in RAM in structures designed so the memory manager automatically loads a block into cache. This requires using specialized data structures and may cause problems when moving data from one system to another.

The alternative is to copy a block’s elements from RAM into a special data structure the memory manager automatically loads into cache. Writing a high speed data copy routine is not trivial. Many languages do not have the features necessary to write one. Writing the routine in one language and calling it from another involves overhead. This overhead may slow the call down more than writing it in another language sped it up.

The dot product example had great load speed. The elements were in one-dimensional arrays. The memory manager automatically loaded the elements into cache in three element blocks.

The matrix-matrix multiplication calculation, as written, has horrible load speed. While the code may think of the data as being in submatrices, the elements in RAM are still in a giant, row-major, one-dimensional array. The data structure must be modified to store the elements in actual submatrices or special code will be needed to copy the submatrix elements from RAM into cache. Please see Implementing a High Performance Matrix-Matrix Block Multiply for one way of storing the matrix elements in submatrices.

Load Frequency

The final piece is ensuring the blocks as a group are loaded into cache as few times as possible. This requires looking at the big picture. Minimizing loads for a part of the calculation may prevent the calculation as a whole from being as fast as possible.

The dot product calculation automatically had the lowest possible load frequency. Each three element block was only needed once and was only loaded into cache once. Unfortunately, knowing how to optimize one part of a calculation may result in suboptimal performance for the calculation as a whole.

In the first matrix-matrix multiply, using the optimized dot product routine resulted in inefficient cache use. Each block of a row of A was needed multiple times and was loaded multiple times. The second optimization changed the order in which matrix-matrix multiplication was calculated, which resulted in each block of A being loaded only once. The third optimization also did this.


The only way to approach the maximum calculation potential of a CPU is by optimizing the calculation for the cache. The high clock speeds of modern CPUs have made RAM access prohibitively expensive. Even non-SIMD calculations written without cache optimization run more than 10x slower than a cache-optimized version.

Running an inefficient calculation on many CPUs is often considered a better solution than cache optimization. While this may raise throughput, latency remains an issue and increases the cost of mistakes. Unknowingly tainting a 10 minute calculation costs 10 minutes. Unknowingly tainting just one of ten 100 minute calculations costs 100 minutes.

SIMD optimization is predicated on cache optimization. The pinnacle of CPU performance is out of reach for those who do not optimize for the cache. The difference in performance between a SIMD-optimized calculation and one not optimized for the cache can be a factor of 1000. This is the difference between taking over eight hours and taking five minutes. 


Following the guidelines above may result in massive speed improvements for a code. If a programmer plans on using SIMD, cache optimization is the first step. However, cache optimization has some perils.

Before doing any cache optimization, a comprehensive test system must be available. If a tested, known-good reference routine is not available, start with the simplest implementation of the calculation. Once that is well-tested, use it as a validator. Cache optimization often involves data structures that create fencepost and/or symmetry problems. Tests must cross over partition boundaries, use non-integer multiples of block sizes, and so forth.

Partitioning schemes must take into account the nature of the data. Does the data naturally partition itself into the chosen block size? Will it continue to do so? Does partitioning happen invisibly to the end-programmer? Or do they have to code for it? All of these questions potentially limit the portability and lifespan of the code.

Cache-optimized calculations will be mathematically identical to the original calculation. But they may not have an identical number of operations. When working with floating point values, this may result in small differences between the two calculations. Additionally, care must be taken to avoid significant digit problems. If the code must match a reference calculation exactly, cache optimizations may be limited or impossible.

Intentionally picking the order of operations in a calculation allows the best optimization the programmer is capable of envisioning. Unfortunately, this increases development time and limits the optimization to what the programmer can conceptualize. Whether this level of optimization is better or worse than what the compiler would produce depends on the compiler and the programmer. Additionally, some compilers may produce better output than a programmer only when “tricked”. Learning how to “trick” the compiler takes time and these tricks may change between minor releases of the compiler. Whether it is better to optimize the code by hand or play compiler games depends on the programmer, the compiler, and the situation.

Different CPUs use different cache sizes. As of 2013, almost all x86-compatible laptop, desktop, and server CPUs use 32 KB L1 data caches. Many tablet and phone CPUs use 16 KB data caches. Some older ones use 8 KB data caches. Whether this is an issue depends on the purpose of the code.

If data partitioning is done via a custom data structure, custom support routines will be needed. Using generic support routines via indexers or mapping objects may leave the system with one amazingly fast calculation and many more extremely slow ones. Writing and testing custom support routines and converters is non-trivial.

This series deals with optimizing for the L1 data cache. When writing code destined for SIMD optimization, this may not be enough. Optimizing beyond the L1 cache is beyond the scope of this article. To get an idea of what is involved, please follow the links from Kazushige Goto’s Wikipedia entry.

Please continue on to the performance comparisons.

Basic Cache Optimization – Part 3

This is part three in a series on basic cache optimization. Please start with part 1.

The second optimization improved the speed of the overall calculation. But, like the first optimization, it required transposing the second matrix. Transposition requires time and memory. However, dividing the rows into blocks was a good start.

Third Optimization

In order to explain the third optimization, we need a cache more like one used in a real CPU. The cache now holds eighteen elements. The memory manager now pulls in nine elements at a time. The example matrices are the same. On paper:


and in RAM:


In the second optimization, each row was divided into two halves. Each half was used to calculate a partial dot product. As the calculation continued, the partial dot products were summed. Once all of the partial dot products were calculated and summed, C = A * B was complete.

Dividing the rows into halves divided, or partitioned, them into blocks. In this case, each block was one row high by three columns wide. While BT wasn’t explicitly partitioned by the code, the memory manager effectively partitioned BT by loading three elements of each row at a time.

A larger cache can hold more elements. The 18 element cache could be used to hold an entire row of A and BT. But that still requires transposing B to create BT. It would be better if B didn’t need to be transposed in order to use the cache effectively. The two matrices are once again A and B, both on paper:


and in RAM:


The second optimization divided each row into three element blocks. This was possible because a dot product can be calculated in parts. However, a row divided into blocks is also a matrix. A three element block is a 1 x 3 matrix.

This means A and B can be divided into sizes other than 1 x 3. The formula for multiplying a matrix that has been divided into blocks is the same as the normal matrix-matrix multiplication formula. The only difference is that the elements of A, B, and C are now matrices. Partitioning A and B into 3×3 submatrices:


turns them into a 3 x 2 matrix and a 2 x 2 matrix:


Since multiplying two matrices that are composed of submatrices follows the normal multiplication rules, A * B = C looks like:


Following the rules, C[0,0] = A[0,0] * B[0,0] + A[0,1] * B[1,0]. Matrix multiplication has already been defined, so calculating A[0,0] * B[0,0] and A[0,1] * B[1,0] is simple. Adding two matrices requires adding corresponding elements, which is also trivial.

One advantage of this is the submatrix size can be tailored to the cache size. Two 3 x 3 matrices will fit into an eighteen element cache. This makes multiplying two 3 x 3 matrices very fast, much like dividing rows into three element pieces made calculating dot products very fast. 

Another advantage is the order of operations can be tailored to use the cache efficiently. Just like the second optimization’s dot products, the submatrix multiplications can be partially calculated. Instead of loading A[0,0] and B[0,0] into cache and multiplying them, then loading A[0,1] and B[0,1] and multiplying them, the code can load A[0,0] into cache and keep it there. It multiplies A[0,0] by B[0,0] and stores a temporary result. Then it multiplies A[0,0] by B[0,1] and stores a temporary result. Only then does it move on to A[0,1]. This minimizes how many times each submatrix is loaded into cache.

The final advantage is this simplifies writing a fast matrix-matrix multiplication routine. Instead of trying to come up with code that is fast for any size matrix, time can be spent coming up with an insanely fast routine for multiplying 3 x 3 matrices. Or 10 x 10. Or whatever size fits the cache best. Once this insanely fast routine is complete, any pair of matrices can be partitioned into blocks of this optimal size.

Unfortunately, getting the data in a block from RAM into cache is not trivial. If the data is stored in a standard row-major or column-major array, special code is needed to copy the data from the main matrix array in RAM to specialized submatrix arrays that can be loaded into cache. If this code is slow, block multiplication will be slower than transposing the second matrix. Fast array copy code often requires programming language features unavailable in most recent programming languages.

The other way of handling this is to store the data in submatrix-major order. This has its own drawbacks that are covered in the conclusion. For an implementation that works in most languages, please see Implementing a High Performance Matrix-Matrix Block Multiply.

Conclusion and concerns are in part four.

Basic Cache Optimization – Part 2

This is part two in a series on basic cache optimization. Please start with part 1.

The first optimization improved the dot product calculation. But the order in which the dot products were calculated resulted in each partial row of A being loaded into cache multiple times. The overall matrix multiplication calculation would run faster if each partial row of A was used as many times as possible before overwriting it in cache.

Second Optimization

A and B are the same as in part one. The cache, CA, still holds six elements. The memory manager retrieves three elements from RAM at a time. As in the first optimization, the calculation has been modified to use transposed B, BT. On paper, the matrices are:


In RAM, they are:


When multiplying two matrices, C = A * B, each element of C, C[i, j], is the dot product of the i-th row of A and the j-th row of BT.

There is no rule that a dot product must be calculated all at once. The formula could break each dot product into sections. C[i, j] is the dot product of the first three element pairs of the i-th row of A and the j-th row of BT plus the dot product of the second three element pairs of the i-th row of A and the j-th row of BT.

In terms of reading data from RAM, the first optimization did this. Each read from RAM brought in three useful elements. Making the calculation explicitly work with three element pairs at a time lets the programmer work better with the cache. Working with three element pieces, or blocks, of each matrix row lets the programmer control, to an extent, what blocks are kept in cache and for how long.

If the code calculates the dot products of all the first three element pairs, then all of the second three element pairs, the number of RAM accesses is reduced. When C[0,0] is calculated, the first three elements of the first rows of A and BT are loaded into cache and their dot product is calculated. This is the same as in the first optimization:

NewImage NewImage

Instead of continuing on to the next three elements of the first rows of A and BT, the code saves this intermediate dot product. The code then calculates the dot product of the first three elements of the first row of A and the first three elements of the second row of BT.


The cache is now:


This process is repeated using the first three elements of the remaining rows of BT. The code is now finished with the first three elements of the first row of A. They were loaded into cache once, used, and can be overwritten.

The code then calculates intermediate dot products for the second three elements of the first row of A and the second three elements of each row of BT. These intermediate dot products are added to the ones stored earlier. The result of this set of calculations is the entire first row of C.

This process is repeated for the remaining rows of A. Once these are complete, all of C has been filled in. This change in calculation order yields a large improvement in the number of RAM accesses.

For the example multiplication, A has nine rows and six columns. B has six rows and six columns. C has nine rows and six columns. The cache, CA, holds six elements and the memory manager loads three elements into cache at at time. Each of the methods requires the following number of RAM accesses.

Method          RAM Accesses
Basic           8 * 54 = 432
First Opt.      4 * 54 = 216
Second Opt.     14 * 9 = 126

The first and second optimizations require transposing B. This takes time and uses memory. There is a better way of optimizing matrix multiplication for the cache.

To be continued in part three.

Basic Cache Optimization – Part 1

Decades of refinement have resulted in CPUs optimized for processing arrays of numerical data. A cache, then multiple caches, were introduced to ameliorate the growing disparity between between CPU and RAM speeds. To achieve the maximum possible CPU performance for a given calculation, the data necessary for the calculation must be loaded into the cache as quickly as possible and as few times as possible.

If all of the data required for a calculation fits into cache simultaneously, no cache-specific optimization is necessary. When the data required for a calculation is too large to fit into the cache, it must be processed in blocks. These blocks should contain all necessary data to complete a subset of the calculation. If possible, they should also contain the data necessary to complete the next subset of the calculation. Failing to do this efficiently will result in unnecessary RAM accesses and longer execution times.

Modern CPUs have multiple cores and multiple caches. Commonly, each core has its own cache (the L1 cache) and shares one or more other caches (L2, L3, perhaps others) with the remaining cores. This article uses vector dot product and matrix-matrix multiplication to demonstrate optimizing for the L1 cache. Depending on the data characteristics and potential processing speed of a particular calculation, optimizing for other caches may be useful.

Basic Example: Vector Dot Product

Some calculations may be inherently cache optimized. Vector dot product is one. Take two nine element vectors, X and Y, stored as one-dimensional arrays. The CPU has a six element cache, CA. To compute X • Y, corresponding elements of X and Y are multiplied together and the resulting products are summed. System memory initially looks like:


The dot product code starts with the first elements of X and Y, X[0] and Y[0]. When the code requests these from RAM, the memory manager fetches X[0] and the values stored in the next several sequential memory locations. These values are copied into cache. It then performs the same process for Y[0].

Once the data has been retrieved, the code stores 1 * 11 in a temporary variable. (Writing to cache will be covered in a separate article.) If the memory manager fetches three elements at a time, system memory now looks like:


The code moves on to the second pair of elements. When it requests X[1] and Y[1], the memory manager retrieves each from cache. This replaces two slow RAM accesses with two fast cache accesses. The third element pair, X[2] and Y[2], is also retrieved from cache.

When the code requests the fourth element pair, the memory manager has to read them from RAM. It performs one read for X’s elements 4, 5, and 6 and another read for Y’s elements 4, 5, and 6. These six values are copied into the cache. System memory now looks like:


Elements pairs 4, 5, and 6 can be multiplied without accessing RAM. Element pairs 7, 8, and 9 require two more RAM accesses. This entire dot product calculation requires six reads from RAM: two reads each for the first, fourth, and seventh element pairs. These six reads load the remaining element pairs as a side-effect.

This calculation is inherently cache-optimized. The vectors are stored as one-dimensional arrays. The array elements are sequential in RAM. The array elements are accessed in sequence. The memory manager’s default actions are optimal for this calculation.

A programmer could short-circuit this inherent optimization. If X • Y were computed as X[0] * Y[0] + X[3] * Y[3] + X[6] * Y[6] and so forth, the calculation would take much longer than necessary. Every element pair would require two reads from RAM. The calculation would go from six RAM reads to eighteen.

Unfortunately, many data structures do not work well with how the cache is filled.

Complex Example: Matrix-Matrix Multiplication

While programmers and researchers often use n-dimensional data structures, RAM is a one-dimensional array. Mapping a two-, three-, or more-dimensional array into one-dimensional RAM often leads to inefficient cache use. Matrix-matrix multiplication is a perfect example. If needed, please see A Quick Introduction to Matrix-Matrix Multiplication for a brief refresher.

Basic Multiplication

Matrices are generally thought of as two-dimensional arrays. But what appears on paper as:


looks like this in RAM:


The first element of A * B is the dot product of A’s first row and B’s first column. On paper, this is simple.


In RAM, it is not.


When the code begins calculating the dot product, it requests A[0,0] and B[0,0]. Using the same six element cache and memory manager that loads three sequential elements at a time, the cache looks like this:


The first three elements of A’s first row are now in cache. The first three elements of B’s first row are also in cache. Unfortunately, we need the first three elements of B’s first column. Only the first element of B’s first row is useful.

Since one row of A and one row of B cannot fit into cache at the same time, the best scenario for a single dot product calculation is four reads from RAM. Two reads per row of A and two reads per column of B. But, as shown, the data structure doesn’t allow this.

With this data structure, each dot product requires eight RAM accesses: one for every three elements of a row of A plus one for every element of a column of B. A * B requires 54 dot products. The multiplication calculation went from 216 RAM accesses to 432 RAM accesses. A modern CPU core can do hundreds if not thousands of operations in the time one RAM access takes.

First Optimization

If B’s columns were in rows, they would be in sequential memory. This would result in better cache use because the memory manager would automatically load three useful elements of B into cache per RAM access. Transposing a matrix swaps the rows and columns. BT represents transposed B. On paper, the new matrices look like:


They look like this in RAM:


This change turns the cache for the first portion of the first dot product into:


This reduces the RAM accesses per dot product to four. Since this is the calculated minimum number of accesses per dot product, the data now seems optimized for the cache.

But it isn’t. Cache usage for calculating a single dot product is optimized. But the calculation multiplies two matrices. The second element of A * B is the dot product of the first row of A and the second column of B. Using BT, the calculation is the dot product of the first row of A and the second row of BT.


This will take an additional four RAM accesses. Two of these will be spent loading the first row of A into the cache for the second time. Optimizing the low-level calculation, in this case dot product, improved performance. However, optimizing the calculation as a whole will cut the number of RAM accesses almost in half for this example.

To be continued in part two.

C#, Generics, and Optimizers

Generics are a handy feature of many languages. C# has had them since version 2.0. In some circumstances, their performance impact is minimal. In others, using them can cause significant slowdowns. Here is an example of the latter.

My matrix interface requires an indexer. Indexers can adversely effect performance, so my matrix classes have indexer versions of each of the multiplication methods for benchmarking purposes. This was done even though the code is shared, as I suspected there would be a performance price for using generics.

Having determined the performance cost of indexers, I decided to convert the methods to generics. I thought the additional cost of generics would be fairly low and it would be handy to easily multiply different matrix classes. The methods went from this format:

public MatrixAA Multiply(MatrixAA multiplier) { .... }

to this:

public static U Multiply<U, V>(U matrixA, V matrixB)
    where U:IMatrix<U>, new()
    where V:IMatrix<V> { .... }

The interface itself uses generics as the required methods must return their own type. New() is required for U so the product matrix can be created.

The performance hit was incredible. The above generic is 82% slower than the non-generic. The parallel version is 65% slower. More sophisticated multiplication methods show similar degradation.

Ildasm‘s disassemblies of the generic and non-generic versions of the method imply the issue is either the use of an interface or the use of a generic interface. The non-generic version compiles to late-bound calls to MatrixAA.get_Item(). The generic version compiles to late-bound calls to IMatrix<U>.get_Item() and IMatrix<V>.get_Item().

The speed difference could be the overhead of getting from IMatrix<U>.get_Item() to MatrixAA.get_Item() at run-time. Or the JIT optimizer has trouble optimizing the generic version of the call. Either way, the performance loss is too great for me to use the generic versions. The indexer methods need to show the cost of using an indexer, not the cost of using generics in this situation.

Implementing a High Performance Matrix-Matrix Block Multiply

A basic block matrix-matrix multiply code divides the operands into logical sub-matrices and multiplies and adds the sub-matrices to calculate the product. For A * B = C and a block size of B:

for (int blockRow = 0; blockRow < A.Rows; blockRow += B) {
    for (int blockColumn = 0; blockColumn < B.Columns; blockColumn += B) {
        for (int blockK = 0; blockK < A.Columns; blockK += B) {
            for (int row = blockRow; row < MIN(blockRow + B, A.Rows); ++row) {
                for (int column = blockColumn; column < MIN(blockColumn + B, B.Columns); ++column) {
                    for (int k = blockK; k < MIN(blockK + B, A.Columns); ++k) {
                        C[row, column] += A[row, k] * B[k, column];

While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly. Even with significant optimization, this code cannot overcome the limitations of most commonly used matrix data structures.

In order to implement a high performance matrix-matrix block multiply, the data structure and implementation code must be carefully optimized. In this article, the block size is B, the matrices are A * B = C, and the matrix elements are double-precision (64-bit) floating point numbers.

The Data Structure

Instead of relying on the compiler to optimize the basic block multiply code, the matrix data will be stored directly in blocks. While a matrix M has X logical rows and Y logical columns of double-precision elements, the actual structure will be an array of arrays. M has CEILING(X / B) rows and CEILING(Y / B) columns. Each physical element of M is a one-dimensional array of doubles with B * B elements. While using an array of arrays of an existing matrix class may be tempting, overall performance is unlikely to be adequate.

This design will result in a small amount of wasted memory. The worst case, where both the number of rows and number of columns are integer multiples of B plus one additional element, will waste X * (B – 1) * sizeof(double) + Y * (B – 1) * sizeof(double) bytes. For a 100,000 x 100,00 matrix, this adds  ~47 MB to the matrix’s ~75 GB size.

The Multiplication Algorithm

Since a matrix of matrices is still a matrix, the standard matrix-matrix multiplication algorithm holds. Each element C[i, j] is the dot product of the i-th row of A and the j-th column of B. The dot product of two vectors of matrices is the sum of the product of the corresponding vector elements.

The Implementation Code

// C99/C++-style code. If using C#, substitute double[] for double*

for (int blockRow = 0; blockRow < A.blockRowCount; ++blockRow) {
    for (int blockColumn = 0; blockColumn < B.blockColumnCount; ++blockColumn) {
        double* subC = C[blockRow, blockColumn];
        for (int blockK = 0; blockK < A.blockColumnCount; ++blockK) {
            double* subA = A[blockRow, blockK];
            double* subB = B[blockK, blockColumn];
            for (int row = 0; row < B; ++row) {
                for (int column = 0; column < B; ++column) {
                    for (int k = 0; k < B; ++k) {
                        subC += subA[row * B + k] * subB[k * B + column];

Average single-threaded performance should be near 98% of theoretical, non-SIMD maximum for matrices over 2000 x 2000. Average multi-threaded performance should be over 95% for matrices over 3000 x 3000. Performance for square matrices should increase as matrix size increases.

If these performance levels are not reached via compiler options, some hand-optimization may be necessary. Once the optimal block size is known, unrolling the k loop will provide a significant speed improvement. Changing the order of the blockRow, blockColumn, and blockK loops may improve performance by a few percentage points.

Determining Block Size

Experimentation will determine the optimal block size. Sqtr(L1 Cache Size / (4 * sizeof(double))) is a good place to start.

Parallelization Note

When parallelizing the code, special attention must be paid to the loop design. The loop design must guarantee each result sub-matrix subC is updated by a single thread. If the design itself cannot guarantee this, explicit locking must be used. If this is not done, random errors may result. These errors may only involve a tiny number of elements out of millions or billions. The test system must include comparing every element of the result to the result of a known-good matrix-matrix multiplication routine.

Final Notes

This combination of data structure, algorithm, and implementation code provides a high-performance, non-SIMD matrix-matrix multiplication routine. Maintaining 90%+ performance using SIMD instructions may require optimizations that go beyond what is possible in straight C/C++/C#.