Compiler Optimizations

In Implementing a High Performance Matrix-Matrix Block Multiply (IaHPM), I listed the basic block multiplication code and said:

While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly.

To demonstrate how well compilers optimize matrix-matrix multiplication codes, including basic block multiplication, I benchmarked four versions of my matrix-matrix multiplication test bed. Under Linux (Fedora Core 18), I compiled the C++ version using GNU’s G++ 4.7 and Intel’s ICPC 13.1.1 from Composer XE 2013 for Linux. Under Windows 7 SP1, I compiled the Visual C++ and C# versions using Visual Studio 2012.

Both operating systems were fully updated when I ran the benchmarks and were run on the same bare hardware. ICPC and G++ used the -O3 flag. Visual C++ used the /O2 flag and architecture-specific optimizations were not specified. C# had optimization on. All of these compilers can use SIMD instructions to a greater (ICPC) or lesser (C#) extent. Given the test results, I suspect ICPC was using SIMD in a couple test cases. I may revisit these tests with SIMD explicitly turned off in the future.

The tests

I benchmarked three matrix sizes: 1000 x 1000, 5000 x 5000, and 15000 x 15000. For the 1K and 5K tests, I used three element storage types: one-dimensional array (1D), array of arrays (AA), and array of arrays of 1024 element one-dimensional arrays (FM).

For 1D and AA, I ran three multiplication methods. Transpose, where the second matrix is transposed and a modified multiplication algorithm is used. Block, using the basic block multiply code presented in IaHPM. Transpose Block, where the second matrix is transposed and a modified block multiply code is used. The FM multiplication used an algorithm similar to the modified block multiply algorithm presented at the end of IaHPM. For the 15K tests, I only ran the fastest method for each of the Windows compilers.

All algorithms used accumulators. Block routines that use MIN had the MIN calculated outside of the for loop test. The FM routines had the inner loop unrolled. Values slightly over 100% may be due to inaccurate CPU speed values. Values significantly over 100% may be due to SIMD instruction use.

1000 x 1000

Screen Shot 2013-05-15 at 5.21.55 PM

The 1000 x 1000 case is somewhat trivial: all of the elements necessary for a dot product will fit into the L1 cache. ICPC does very well at optimizing the transpose methods. Note that C# did not optimize the block methods very well. G++ and VC++ recognize the block code: the block routines are nearly the same speed as the basic transpose method.

5000 x 5000

Screen Shot 2013-05-15 at 5.22.40 PM

ICPC optimized the transpose block methods well. However, G++ and VC++ take the lead using the specialized FM data structure and routine.

15000 x 15000

Screen Shot 2013-05-15 at 5.22.53 PM

ICPC is still in third place, but not by much. C# makes a surprisingly strong showing. I may update this chart with the VC++ and C# values for the other multiplication methods in the future.

Thoughts

Is creating specialized data structures and calculation methods worth the effort? It depends on the situation. For matrix-matrix multiplication, probably not. ICPC did a great job on the transpose block algorithm. This allows the use of a one-dimensional array, which allows many standard libraries to be used.

Note: Please do not use your own matrix-matrix multiplication routines in a production system without having an amazingly good reason to do so. Use ATLAS, Intel’s MKL, cuBLAS, Math.Net, or some other BLAS implementation.

The critical question when it comes to optimizing compilers is: how easy is it to get an optimizing compiler to optimize your code. Compilers only optimize situations they recognize. If it optimizes your code as written, that’s great. If it doesn’t, is it better to spend time working with the compiler or creating better data structures and code?

There isn’t an easy answer to this question. If your routine has to work across multiple compilers, it may be better to work on the implementation details. If you know you’ll be using an optimizing compiler on a regular basis, it may be better to learn the intricacies of that compiler. If processing speed is the competitive advantage for your research or business, it may be best to do both.

To get an optimizing compiler to optimize a piece of code, the code may need to be written in a specific way. Additionally, optimal performance may require testing different combinations of compiler flags. As the above results show, just putting an existing piece of code into an optimizing compiler using a “normal” set of compiler flags does not guarantee good results.

C++, Templates, and Optimizers

Using C# generics can cause performance issues. When I started porting my matrix-matrix multiplication test bed to C++, I was curious if templates provided better performance. I had hoped C++’s compiled nature and more robust optimizer would make templates run nearly as fast as regular code.

But they don’t. Additionally, the performance hit is split differently than in C#. In C#, most of the performance loss is in going from the indexer method to the generic method. In C++, most of the performance loss is in going from the regular method to the indexer method.

My C++ matrices all inherit from the MatrixBase virtual class. Three indexer methods, GetValue, SetValue, and AddValue, allow for generic matrix-matrix multiplication routines. All three are implemented as inline methods in every class header file.

The execution time of multiplying two 1000 x 1000 double-precision matrices 20 times is disappointing. The code was compiled in Visual C++ under Visual Studio 2012 with /Ob2, /O2, /Oi, and /Ot. All listed methods use accumulators, store data in arrays of arrays, and run in parallel across four cores. The figures shown are elapsed time in seconds for the run.

Acc            5.361
Acc-Indexer   29.939
Acc-Template  40.211

Other multiplication and storage methods had similar differences. C# seemed to do well inlining the indexers then had an optimizer issue implementing the generics.

VC++ seemed to have issues inlining the indexers, then did an OK job implementing the templates. Why? Is this a compiler-specific issue? I hope to test this in g++ and ICL, though this is low on my priority list.

Since templates are not critical to the core sections of my code, I will not be going through the generated assembly code. Templates work well in the non-critical sections. I recommend using templates to simplify testing and in the routines that organize benchmarking. But I can’t recommend them for performance-critical applications, at least not using Visual C++.

These performance differences surprise me and could be my fault. If you have a suggestion to improve template performance, please leave a comment.