Compiler Optimizations

In Implementing a High Performance Matrix-Matrix Block Multiply (IaHPM), I listed the basic block multiplication code and said:

While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly.

To demonstrate how well compilers optimize matrix-matrix multiplication codes, including basic block multiplication, I benchmarked four versions of my matrix-matrix multiplication test bed. Under Linux (Fedora Core 18), I compiled the C++ version using GNU’s G++ 4.7 and Intel’s ICPC 13.1.1 from Composer XE 2013 for Linux. Under Windows 7 SP1, I compiled the Visual C++ and C# versions using Visual Studio 2012.

Both operating systems were fully updated when I ran the benchmarks and were run on the same bare hardware. ICPC and G++ used the -O3 flag. Visual C++ used the /O2 flag and architecture-specific optimizations were not specified. C# had optimization on. All of these compilers can use SIMD instructions to a greater (ICPC) or lesser (C#) extent. Given the test results, I suspect ICPC was using SIMD in a couple test cases. I may revisit these tests with SIMD explicitly turned off in the future.

The tests

I benchmarked three matrix sizes: 1000 x 1000, 5000 x 5000, and 15000 x 15000. For the 1K and 5K tests, I used three element storage types: one-dimensional array (1D), array of arrays (AA), and array of arrays of 1024 element one-dimensional arrays (FM).

For 1D and AA, I ran three multiplication methods. Transpose, where the second matrix is transposed and a modified multiplication algorithm is used. Block, using the basic block multiply code presented in IaHPM. Transpose Block, where the second matrix is transposed and a modified block multiply code is used. The FM multiplication used an algorithm similar to the modified block multiply algorithm presented at the end of IaHPM. For the 15K tests, I only ran the fastest method for each of the Windows compilers.

All algorithms used accumulators. Block routines that use MIN had the MIN calculated outside of the for loop test. The FM routines had the inner loop unrolled. Values slightly over 100% may be due to inaccurate CPU speed values. Values significantly over 100% may be due to SIMD instruction use.

1000 x 1000

Screen Shot 2013-05-15 at 5.21.55 PM

The 1000 x 1000 case is somewhat trivial: all of the elements necessary for a dot product will fit into the L1 cache. ICPC does very well at optimizing the transpose methods. Note that C# did not optimize the block methods very well. G++ and VC++ recognize the block code: the block routines are nearly the same speed as the basic transpose method.

5000 x 5000

Screen Shot 2013-05-15 at 5.22.40 PM

ICPC optimized the transpose block methods well. However, G++ and VC++ take the lead using the specialized FM data structure and routine.

15000 x 15000

Screen Shot 2013-05-15 at 5.22.53 PM

ICPC is still in third place, but not by much. C# makes a surprisingly strong showing. I may update this chart with the VC++ and C# values for the other multiplication methods in the future.


Is creating specialized data structures and calculation methods worth the effort? It depends on the situation. For matrix-matrix multiplication, probably not. ICPC did a great job on the transpose block algorithm. This allows the use of a one-dimensional array, which allows many standard libraries to be used.

Note: Please do not use your own matrix-matrix multiplication routines in a production system without having an amazingly good reason to do so. Use ATLAS, Intel’s MKL, cuBLAS, Math.Net, or some other BLAS implementation.

The critical question when it comes to optimizing compilers is: how easy is it to get an optimizing compiler to optimize your code. Compilers only optimize situations they recognize. If it optimizes your code as written, that’s great. If it doesn’t, is it better to spend time working with the compiler or creating better data structures and code?

There isn’t an easy answer to this question. If your routine has to work across multiple compilers, it may be better to work on the implementation details. If you know you’ll be using an optimizing compiler on a regular basis, it may be better to learn the intricacies of that compiler. If processing speed is the competitive advantage for your research or business, it may be best to do both.

To get an optimizing compiler to optimize a piece of code, the code may need to be written in a specific way. Additionally, optimal performance may require testing different combinations of compiler flags. As the above results show, just putting an existing piece of code into an optimizing compiler using a “normal” set of compiler flags does not guarantee good results.

Basic Cache Optimization – Performance Comparisons

Benchmarking Software

I have written a matrix-matrix multiplication testbed. The source code is available on GitHub. It explores far more than what was covered in this series, including the impact of accumulators, indexers, parallelism, underlying data structures, and so forth. The system does manage maximum theoretical, non-SIMD CPU performance. The code is in C# but easily converts to C++ and the concepts are applicable to most languages. If there is enough demand, a C++ version of the test bed will be released. The solution loads and compiles under Xamarin Studio and runs under Mono.

Test System

The test system is a 3.4 GHz Core i7-2600K “Sandy Bridge” CPU with 16 GB of RAM. Turbo Boost was disabled to allow a more accurate estimate of the potential performance of the system. Hyper-Threading was also disabled. Only the operations required for multiplication were measured. Creating and filling test matrices with random data was not included in the performance figures. For methods that required a matrix transposition, the transposition time was included. All matrix elements are double-precision (64-bit) floating point values.

Performance Figures

The performance figures compare a straightforward matrix-matrix multiplication routine to a transposition routine and a block multiplication routine. The latter two correspond to the first and third optimizations. All routines use accumulators.

I have not written a version of the second optimization for reasons beyond the scope of this article. The first two routines use an array of arrays to hold the matrix elements. The third uses an array of arrays of 1024 element one-dimensional arrays. This is the equivalent of a 32 x 32 block size.

Performance figures shown are a percentage of the test system’s theoretical maximum possible MFLOPS without SIMD. The theoretical maximum is approximately 3400 MFLOPS for single-core routines and 13600 MFLOPS for four-core. Both values were treated as exact for calculation purposes.

The MFLOPS of each operation was calculated using the average time from 100 runs for 1000 x 1000 matrices, 20 runs for 5000 x 5000, and 5 runs for 10000 x 10000 and 15000 x 15000. The number of floating point operations was calculated using the minimum possible for a square matrix of size N x N: 2 * N^3 – N^2. This slightly understates the actual number of operations performed, thus understating the actual MFLOPS.

1000 x 1000



Given the performance improvement, the complexity of the block algorithm compared to the transpose algorithm may not seem to justify the time investment. However, once larger matrices are multiplied, the benefits become apparent. 



For comparison purposes, here is Math.Net‘s Intel MKL-based matrix-matrix multiplication vs the block algorithm. This chart shows MFLOPS, not percentage of maximum performance. Both algorithms use all four cores. The theoretical maximum of the test machine using SIMD on four cores is 109 GFLOPS.