C++, Templates, and Optimizers

Using C# generics can cause performance issues. When I started porting my matrix-matrix multiplication test bed to C++, I was curious if templates provided better performance. I had hoped C++’s compiled nature and more robust optimizer would make templates run nearly as fast as regular code.

But they don’t. Additionally, the performance hit is split differently than in C#. In C#, most of the performance loss is in going from the indexer method to the generic method. In C++, most of the performance loss is in going from the regular method to the indexer method.

My C++ matrices all inherit from the MatrixBase virtual class. Three indexer methods, GetValue, SetValue, and AddValue, allow for generic matrix-matrix multiplication routines. All three are implemented as inline methods in every class header file.

The execution time of multiplying two 1000 x 1000 double-precision matrices 20 times is disappointing. The code was compiled in Visual C++ under Visual Studio 2012 with /Ob2, /O2, /Oi, and /Ot. All listed methods use accumulators, store data in arrays of arrays, and run in parallel across four cores. The figures shown are elapsed time in seconds for the run.

Acc            5.361
Acc-Indexer   29.939
Acc-Template  40.211

Other multiplication and storage methods had similar differences. C# seemed to do well inlining the indexers then had an optimizer issue implementing the generics.

VC++ seemed to have issues inlining the indexers, then did an OK job implementing the templates. Why? Is this a compiler-specific issue? I hope to test this in g++ and ICL, though this is low on my priority list.

Since templates are not critical to the core sections of my code, I will not be going through the generated assembly code. Templates work well in the non-critical sections. I recommend using templates to simplify testing and in the routines that organize benchmarking. But I can’t recommend them for performance-critical applications, at least not using Visual C++.

These performance differences surprise me and could be my fault. If you have a suggestion to improve template performance, please leave a comment.

C#, Generics, and Optimizers

Generics are a handy feature of many languages. C# has had them since version 2.0. In some circumstances, their performance impact is minimal. In others, using them can cause significant slowdowns. Here is an example of the latter.

My matrix interface requires an indexer. Indexers can adversely effect performance, so my matrix classes have indexer versions of each of the multiplication methods for benchmarking purposes. This was done even though the code is shared, as I suspected there would be a performance price for using generics.

Having determined the performance cost of indexers, I decided to convert the methods to generics. I thought the additional cost of generics would be fairly low and it would be handy to easily multiply different matrix classes. The methods went from this format:

public MatrixAA Multiply(MatrixAA multiplier) { .... }

to this:

public static U Multiply<U, V>(U matrixA, V matrixB)
    where U:IMatrix<U>, new()
    where V:IMatrix<V> { .... }

The interface itself uses generics as the required methods must return their own type. New() is required for U so the product matrix can be created.

The performance hit was incredible. The above generic is 82% slower than the non-generic. The parallel version is 65% slower. More sophisticated multiplication methods show similar degradation.

Ildasm‘s disassemblies of the generic and non-generic versions of the method imply the issue is either the use of an interface or the use of a generic interface. The non-generic version compiles to late-bound calls to MatrixAA.get_Item(). The generic version compiles to late-bound calls to IMatrix<U>.get_Item() and IMatrix<V>.get_Item().

The speed difference could be the overhead of getting from IMatrix<U>.get_Item() to MatrixAA.get_Item() at run-time. Or the JIT optimizer has trouble optimizing the generic version of the call. Either way, the performance loss is too great for me to use the generic versions. The indexer methods need to show the cost of using an indexer, not the cost of using generics in this situation.