Compiler Optimizations

In Implementing a High Performance Matrix-Matrix Block Multiply (IaHPM), I listed the basic block multiplication code and said:

While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly.

To demonstrate how well compilers optimize matrix-matrix multiplication codes, including basic block multiplication, I benchmarked four versions of my matrix-matrix multiplication test bed. Under Linux (Fedora Core 18), I compiled the C++ version using GNU’s G++ 4.7 and Intel’s ICPC 13.1.1 from Composer XE 2013 for Linux. Under Windows 7 SP1, I compiled the Visual C++ and C# versions using Visual Studio 2012.

Both operating systems were fully updated when I ran the benchmarks and were run on the same bare hardware. ICPC and G++ used the -O3 flag. Visual C++ used the /O2 flag and architecture-specific optimizations were not specified. C# had optimization on. All of these compilers can use SIMD instructions to a greater (ICPC) or lesser (C#) extent. Given the test results, I suspect ICPC was using SIMD in a couple test cases. I may revisit these tests with SIMD explicitly turned off in the future.

The tests

I benchmarked three matrix sizes: 1000 x 1000, 5000 x 5000, and 15000 x 15000. For the 1K and 5K tests, I used three element storage types: one-dimensional array (1D), array of arrays (AA), and array of arrays of 1024 element one-dimensional arrays (FM).

For 1D and AA, I ran three multiplication methods. Transpose, where the second matrix is transposed and a modified multiplication algorithm is used. Block, using the basic block multiply code presented in IaHPM. Transpose Block, where the second matrix is transposed and a modified block multiply code is used. The FM multiplication used an algorithm similar to the modified block multiply algorithm presented at the end of IaHPM. For the 15K tests, I only ran the fastest method for each of the Windows compilers.

All algorithms used accumulators. Block routines that use MIN had the MIN calculated outside of the for loop test. The FM routines had the inner loop unrolled. Values slightly over 100% may be due to inaccurate CPU speed values. Values significantly over 100% may be due to SIMD instruction use.

1000 x 1000

Screen Shot 2013-05-15 at 5.21.55 PM

The 1000 x 1000 case is somewhat trivial: all of the elements necessary for a dot product will fit into the L1 cache. ICPC does very well at optimizing the transpose methods. Note that C# did not optimize the block methods very well. G++ and VC++ recognize the block code: the block routines are nearly the same speed as the basic transpose method.

5000 x 5000

Screen Shot 2013-05-15 at 5.22.40 PM

ICPC optimized the transpose block methods well. However, G++ and VC++ take the lead using the specialized FM data structure and routine.

15000 x 15000

Screen Shot 2013-05-15 at 5.22.53 PM

ICPC is still in third place, but not by much. C# makes a surprisingly strong showing. I may update this chart with the VC++ and C# values for the other multiplication methods in the future.

Thoughts

Is creating specialized data structures and calculation methods worth the effort? It depends on the situation. For matrix-matrix multiplication, probably not. ICPC did a great job on the transpose block algorithm. This allows the use of a one-dimensional array, which allows many standard libraries to be used.

Note: Please do not use your own matrix-matrix multiplication routines in a production system without having an amazingly good reason to do so. Use ATLAS, Intel’s MKL, cuBLAS, Math.Net, or some other BLAS implementation.

The critical question when it comes to optimizing compilers is: how easy is it to get an optimizing compiler to optimize your code. Compilers only optimize situations they recognize. If it optimizes your code as written, that’s great. If it doesn’t, is it better to spend time working with the compiler or creating better data structures and code?

There isn’t an easy answer to this question. If your routine has to work across multiple compilers, it may be better to work on the implementation details. If you know you’ll be using an optimizing compiler on a regular basis, it may be better to learn the intricacies of that compiler. If processing speed is the competitive advantage for your research or business, it may be best to do both.

To get an optimizing compiler to optimize a piece of code, the code may need to be written in a specific way. Additionally, optimal performance may require testing different combinations of compiler flags. As the above results show, just putting an existing piece of code into an optimizing compiler using a “normal” set of compiler flags does not guarantee good results.

C#, Generics, and Optimizers

Generics are a handy feature of many languages. C# has had them since version 2.0. In some circumstances, their performance impact is minimal. In others, using them can cause significant slowdowns. Here is an example of the latter.

My matrix interface requires an indexer. Indexers can adversely effect performance, so my matrix classes have indexer versions of each of the multiplication methods for benchmarking purposes. This was done even though the code is shared, as I suspected there would be a performance price for using generics.

Having determined the performance cost of indexers, I decided to convert the methods to generics. I thought the additional cost of generics would be fairly low and it would be handy to easily multiply different matrix classes. The methods went from this format:

public MatrixAA Multiply(MatrixAA multiplier) { .... }

to this:

public static U Multiply<U, V>(U matrixA, V matrixB)
    where U:IMatrix<U>, new()
    where V:IMatrix<V> { .... }

The interface itself uses generics as the required methods must return their own type. New() is required for U so the product matrix can be created.

The performance hit was incredible. The above generic is 82% slower than the non-generic. The parallel version is 65% slower. More sophisticated multiplication methods show similar degradation.

Ildasm‘s disassemblies of the generic and non-generic versions of the method imply the issue is either the use of an interface or the use of a generic interface. The non-generic version compiles to late-bound calls to MatrixAA.get_Item(). The generic version compiles to late-bound calls to IMatrix<U>.get_Item() and IMatrix<V>.get_Item().

The speed difference could be the overhead of getting from IMatrix<U>.get_Item() to MatrixAA.get_Item() at run-time. Or the JIT optimizer has trouble optimizing the generic version of the call. Either way, the performance loss is too great for me to use the generic versions. The indexer methods need to show the cost of using an indexer, not the cost of using generics in this situation.