Two additions

A busy summer and an even busier fall left little time for writing. I have a better defined schedule now, one which allows more time for hobbies. My two main passions are computing and cycling and I am putting more time into both. I also plan on writing about each of them.

While all posts will still live in the Blog section, I’ve added two new menu items: Computing and Cycling. If you are only interested in posts about computing, select the Computing menu option. If you are only interested in Cycling, select that menu option. If there is interest, I will look into creating separate RSS feeds.

No existing URLs have changed. Articles I feel are of special note will still be highlighted on the front page of the site.

Compiler Optimizations

In Implementing a High Performance Matrix-Matrix Block Multiply (IaHPM), I listed the basic block multiplication code and said:

While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly.

To demonstrate how well compilers optimize matrix-matrix multiplication codes, including basic block multiplication, I benchmarked four versions of my matrix-matrix multiplication test bed. Under Linux (Fedora Core 18), I compiled the C++ version using GNU’s G++ 4.7 and Intel’s ICPC 13.1.1 from Composer XE 2013 for Linux. Under Windows 7 SP1, I compiled the Visual C++ and C# versions using Visual Studio 2012.

Both operating systems were fully updated when I ran the benchmarks and were run on the same bare hardware. ICPC and G++ used the -O3 flag. Visual C++ used the /O2 flag and architecture-specific optimizations were not specified. C# had optimization on. All of these compilers can use SIMD instructions to a greater (ICPC) or lesser (C#) extent. Given the test results, I suspect ICPC was using SIMD in a couple test cases. I may revisit these tests with SIMD explicitly turned off in the future.

The tests

I benchmarked three matrix sizes: 1000 x 1000, 5000 x 5000, and 15000 x 15000. For the 1K and 5K tests, I used three element storage types: one-dimensional array (1D), array of arrays (AA), and array of arrays of 1024 element one-dimensional arrays (FM).

For 1D and AA, I ran three multiplication methods. Transpose, where the second matrix is transposed and a modified multiplication algorithm is used. Block, using the basic block multiply code presented in IaHPM. Transpose Block, where the second matrix is transposed and a modified block multiply code is used. The FM multiplication used an algorithm similar to the modified block multiply algorithm presented at the end of IaHPM. For the 15K tests, I only ran the fastest method for each of the Windows compilers.

All algorithms used accumulators. Block routines that use MIN had the MIN calculated outside of the for loop test. The FM routines had the inner loop unrolled. Values slightly over 100% may be due to inaccurate CPU speed values. Values significantly over 100% may be due to SIMD instruction use.

1000 x 1000

Screen Shot 2013-05-15 at 5.21.55 PM

The 1000 x 1000 case is somewhat trivial: all of the elements necessary for a dot product will fit into the L1 cache. ICPC does very well at optimizing the transpose methods. Note that C# did not optimize the block methods very well. G++ and VC++ recognize the block code: the block routines are nearly the same speed as the basic transpose method.

5000 x 5000

Screen Shot 2013-05-15 at 5.22.40 PM

ICPC optimized the transpose block methods well. However, G++ and VC++ take the lead using the specialized FM data structure and routine.

15000 x 15000

Screen Shot 2013-05-15 at 5.22.53 PM

ICPC is still in third place, but not by much. C# makes a surprisingly strong showing. I may update this chart with the VC++ and C# values for the other multiplication methods in the future.

Thoughts

Is creating specialized data structures and calculation methods worth the effort? It depends on the situation. For matrix-matrix multiplication, probably not. ICPC did a great job on the transpose block algorithm. This allows the use of a one-dimensional array, which allows many standard libraries to be used.

Note: Please do not use your own matrix-matrix multiplication routines in a production system without having an amazingly good reason to do so. Use ATLAS, Intel’s MKL, cuBLAS, Math.Net, or some other BLAS implementation.

The critical question when it comes to optimizing compilers is: how easy is it to get an optimizing compiler to optimize your code. Compilers only optimize situations they recognize. If it optimizes your code as written, that’s great. If it doesn’t, is it better to spend time working with the compiler or creating better data structures and code?

There isn’t an easy answer to this question. If your routine has to work across multiple compilers, it may be better to work on the implementation details. If you know you’ll be using an optimizing compiler on a regular basis, it may be better to learn the intricacies of that compiler. If processing speed is the competitive advantage for your research or business, it may be best to do both.

To get an optimizing compiler to optimize a piece of code, the code may need to be written in a specific way. Additionally, optimal performance may require testing different combinations of compiler flags. As the above results show, just putting an existing piece of code into an optimizing compiler using a “normal” set of compiler flags does not guarantee good results.

C++, Templates, and Optimizers

Using C# generics can cause performance issues. When I started porting my matrix-matrix multiplication test bed to C++, I was curious if templates provided better performance. I had hoped C++’s compiled nature and more robust optimizer would make templates run nearly as fast as regular code.

But they don’t. Additionally, the performance hit is split differently than in C#. In C#, most of the performance loss is in going from the indexer method to the generic method. In C++, most of the performance loss is in going from the regular method to the indexer method.

My C++ matrices all inherit from the MatrixBase virtual class. Three indexer methods, GetValue, SetValue, and AddValue, allow for generic matrix-matrix multiplication routines. All three are implemented as inline methods in every class header file.

The execution time of multiplying two 1000 x 1000 double-precision matrices 20 times is disappointing. The code was compiled in Visual C++ under Visual Studio 2012 with /Ob2, /O2, /Oi, and /Ot. All listed methods use accumulators, store data in arrays of arrays, and run in parallel across four cores. The figures shown are elapsed time in seconds for the run.

Acc            5.361
Acc-Indexer   29.939
Acc-Template  40.211

Other multiplication and storage methods had similar differences. C# seemed to do well inlining the indexers then had an optimizer issue implementing the generics.

VC++ seemed to have issues inlining the indexers, then did an OK job implementing the templates. Why? Is this a compiler-specific issue? I hope to test this in g++ and ICL, though this is low on my priority list.

Since templates are not critical to the core sections of my code, I will not be going through the generated assembly code. Templates work well in the non-critical sections. I recommend using templates to simplify testing and in the routines that organize benchmarking. But I can’t recommend them for performance-critical applications, at least not using Visual C++.

These performance differences surprise me and could be my fault. If you have a suggestion to improve template performance, please leave a comment.

Basic Cache Optimization – Performance Comparisons

Benchmarking Software

I have written a matrix-matrix multiplication testbed. The source code is available on GitHub. It explores far more than what was covered in this series, including the impact of accumulators, indexers, parallelism, underlying data structures, and so forth. The system does manage maximum theoretical, non-SIMD CPU performance. The code is in C# but easily converts to C++ and the concepts are applicable to most languages. If there is enough demand, a C++ version of the test bed will be released. The solution loads and compiles under Xamarin Studio and runs under Mono.

Test System

The test system is a 3.4 GHz Core i7-2600K “Sandy Bridge” CPU with 16 GB of RAM. Turbo Boost was disabled to allow a more accurate estimate of the potential performance of the system. Hyper-Threading was also disabled. Only the operations required for multiplication were measured. Creating and filling test matrices with random data was not included in the performance figures. For methods that required a matrix transposition, the transposition time was included. All matrix elements are double-precision (64-bit) floating point values.

Performance Figures

The performance figures compare a straightforward matrix-matrix multiplication routine to a transposition routine and a block multiplication routine. The latter two correspond to the first and third optimizations. All routines use accumulators.

I have not written a version of the second optimization for reasons beyond the scope of this article. The first two routines use an array of arrays to hold the matrix elements. The third uses an array of arrays of 1024 element one-dimensional arrays. This is the equivalent of a 32 x 32 block size.

Performance figures shown are a percentage of the test system’s theoretical maximum possible MFLOPS without SIMD. The theoretical maximum is approximately 3400 MFLOPS for single-core routines and 13600 MFLOPS for four-core. Both values were treated as exact for calculation purposes.

The MFLOPS of each operation was calculated using the average time from 100 runs for 1000 x 1000 matrices, 20 runs for 5000 x 5000, and 5 runs for 10000 x 10000 and 15000 x 15000. The number of floating point operations was calculated using the minimum possible for a square matrix of size N x N: 2 * N^3 – N^2. This slightly understates the actual number of operations performed, thus understating the actual MFLOPS.

1000 x 1000

NewImage

Scaling

Given the performance improvement, the complexity of the block algorithm compared to the transpose algorithm may not seem to justify the time investment. However, once larger matrices are multiplied, the benefits become apparent. 

NewImage

SIMD

For comparison purposes, here is Math.Net‘s Intel MKL-based matrix-matrix multiplication vs the block algorithm. This chart shows MFLOPS, not percentage of maximum performance. Both algorithms use all four cores. The theoretical maximum of the test machine using SIMD on four cores is 109 GFLOPS.

NewImage

 

Basic Cache Optimization – Part 4

Guidelines

The first three parts of this series walked through optimizing dot products and matrix-matrix multiplication for the cache. The general guidelines for cache optimization are partitioning, proximity, load speed, and load frequency.

Partitioning

The calculation must be divisible into subcalculations. Both the dot product and matrix-matrix multiplication calculations can be divided. Dot product can be done in groups and matrix-matrix multiplication can be done by multiplying submatrices.

The data must be divisible into blocks. The blocks must be sized such that as many as needed fit into cache simultaneously. Both examples were readily divisible. The dot products were divided into three element blocks. The matrices were divided into 3×3 submatrices. Two three element dot product blocks fit into cache at once, as did two 3×3 submatrices.

Proximity

Partitioning the data into blocks isn’t enough. When partitioned, the elements necessary for the divided calculation must all lie within the same block. Both examples had good proximity. The three element dot product blocks contained all the elements needed for that part of the dot product. The 3×3 submatrices for matrix-matrix multiplication contained all of the elements necessary for their multiplication.

Load Speed

Moving the data from the blocks into cache must be fast. This can be done in two ways. The data can be stored in RAM in structures designed so the memory manager automatically loads a block into cache. This requires using specialized data structures and may cause problems when moving data from one system to another.

The alternative is to copy a block’s elements from RAM into a special data structure the memory manager automatically loads into cache. Writing a high speed data copy routine is not trivial. Many languages do not have the features necessary to write one. Writing the routine in one language and calling it from another involves overhead. This overhead may slow the call down more than writing it in another language sped it up.

The dot product example had great load speed. The elements were in one-dimensional arrays. The memory manager automatically loaded the elements into cache in three element blocks.

The matrix-matrix multiplication calculation, as written, has horrible load speed. While the code may think of the data as being in submatrices, the elements in RAM are still in a giant, row-major, one-dimensional array. The data structure must be modified to store the elements in actual submatrices or special code will be needed to copy the submatrix elements from RAM into cache. Please see Implementing a High Performance Matrix-Matrix Block Multiply for one way of storing the matrix elements in submatrices.

Load Frequency

The final piece is ensuring the blocks as a group are loaded into cache as few times as possible. This requires looking at the big picture. Minimizing loads for a part of the calculation may prevent the calculation as a whole from being as fast as possible.

The dot product calculation automatically had the lowest possible load frequency. Each three element block was only needed once and was only loaded into cache once. Unfortunately, knowing how to optimize one part of a calculation may result in suboptimal performance for the calculation as a whole.

In the first matrix-matrix multiply, using the optimized dot product routine resulted in inefficient cache use. Each block of a row of A was needed multiple times and was loaded multiple times. The second optimization changed the order in which matrix-matrix multiplication was calculated, which resulted in each block of A being loaded only once. The third optimization also did this.

Conclusion

The only way to approach the maximum calculation potential of a CPU is by optimizing the calculation for the cache. The high clock speeds of modern CPUs have made RAM access prohibitively expensive. Even non-SIMD calculations written without cache optimization run more than 10x slower than a cache-optimized version.

Running an inefficient calculation on many CPUs is often considered a better solution than cache optimization. While this may raise throughput, latency remains an issue and increases the cost of mistakes. Unknowingly tainting a 10 minute calculation costs 10 minutes. Unknowingly tainting just one of ten 100 minute calculations costs 100 minutes.

SIMD optimization is predicated on cache optimization. The pinnacle of CPU performance is out of reach for those who do not optimize for the cache. The difference in performance between a SIMD-optimized calculation and one not optimized for the cache can be a factor of 1000. This is the difference between taking over eight hours and taking five minutes. 

Concerns

Following the guidelines above may result in massive speed improvements for a code. If a programmer plans on using SIMD, cache optimization is the first step. However, cache optimization has some perils.

Before doing any cache optimization, a comprehensive test system must be available. If a tested, known-good reference routine is not available, start with the simplest implementation of the calculation. Once that is well-tested, use it as a validator. Cache optimization often involves data structures that create fencepost and/or symmetry problems. Tests must cross over partition boundaries, use non-integer multiples of block sizes, and so forth.

Partitioning schemes must take into account the nature of the data. Does the data naturally partition itself into the chosen block size? Will it continue to do so? Does partitioning happen invisibly to the end-programmer? Or do they have to code for it? All of these questions potentially limit the portability and lifespan of the code.

Cache-optimized calculations will be mathematically identical to the original calculation. But they may not have an identical number of operations. When working with floating point values, this may result in small differences between the two calculations. Additionally, care must be taken to avoid significant digit problems. If the code must match a reference calculation exactly, cache optimizations may be limited or impossible.

Intentionally picking the order of operations in a calculation allows the best optimization the programmer is capable of envisioning. Unfortunately, this increases development time and limits the optimization to what the programmer can conceptualize. Whether this level of optimization is better or worse than what the compiler would produce depends on the compiler and the programmer. Additionally, some compilers may produce better output than a programmer only when “tricked”. Learning how to “trick” the compiler takes time and these tricks may change between minor releases of the compiler. Whether it is better to optimize the code by hand or play compiler games depends on the programmer, the compiler, and the situation.

Different CPUs use different cache sizes. As of 2013, almost all x86-compatible laptop, desktop, and server CPUs use 32 KB L1 data caches. Many tablet and phone CPUs use 16 KB data caches. Some older ones use 8 KB data caches. Whether this is an issue depends on the purpose of the code.

If data partitioning is done via a custom data structure, custom support routines will be needed. Using generic support routines via indexers or mapping objects may leave the system with one amazingly fast calculation and many more extremely slow ones. Writing and testing custom support routines and converters is non-trivial.

This series deals with optimizing for the L1 data cache. When writing code destined for SIMD optimization, this may not be enough. Optimizing beyond the L1 cache is beyond the scope of this article. To get an idea of what is involved, please follow the links from Kazushige Goto’s Wikipedia entry.

Please continue on to the performance comparisons.

Basic Cache Optimization – Part 3

This is part three in a series on basic cache optimization. Please start with part 1.

The second optimization improved the speed of the overall calculation. But, like the first optimization, it required transposing the second matrix. Transposition requires time and memory. However, dividing the rows into blocks was a good start.

Third Optimization

In order to explain the third optimization, we need a cache more like one used in a real CPU. The cache now holds eighteen elements. The memory manager now pulls in nine elements at a time. The example matrices are the same. On paper:

NewImage

and in RAM:

NewImage

In the second optimization, each row was divided into two halves. Each half was used to calculate a partial dot product. As the calculation continued, the partial dot products were summed. Once all of the partial dot products were calculated and summed, C = A * B was complete.

Dividing the rows into halves divided, or partitioned, them into blocks. In this case, each block was one row high by three columns wide. While BT wasn’t explicitly partitioned by the code, the memory manager effectively partitioned BT by loading three elements of each row at a time.

A larger cache can hold more elements. The 18 element cache could be used to hold an entire row of A and BT. But that still requires transposing B to create BT. It would be better if B didn’t need to be transposed in order to use the cache effectively. The two matrices are once again A and B, both on paper:

NewImage

and in RAM:

NewImage

The second optimization divided each row into three element blocks. This was possible because a dot product can be calculated in parts. However, a row divided into blocks is also a matrix. A three element block is a 1 x 3 matrix.

This means A and B can be divided into sizes other than 1 x 3. The formula for multiplying a matrix that has been divided into blocks is the same as the normal matrix-matrix multiplication formula. The only difference is that the elements of A, B, and C are now matrices. Partitioning A and B into 3×3 submatrices:

NewImage

turns them into a 3 x 2 matrix and a 2 x 2 matrix:

NewImage

Since multiplying two matrices that are composed of submatrices follows the normal multiplication rules, A * B = C looks like:

NewImage

Following the rules, C[0,0] = A[0,0] * B[0,0] + A[0,1] * B[1,0]. Matrix multiplication has already been defined, so calculating A[0,0] * B[0,0] and A[0,1] * B[1,0] is simple. Adding two matrices requires adding corresponding elements, which is also trivial.

One advantage of this is the submatrix size can be tailored to the cache size. Two 3 x 3 matrices will fit into an eighteen element cache. This makes multiplying two 3 x 3 matrices very fast, much like dividing rows into three element pieces made calculating dot products very fast. 

Another advantage is the order of operations can be tailored to use the cache efficiently. Just like the second optimization’s dot products, the submatrix multiplications can be partially calculated. Instead of loading A[0,0] and B[0,0] into cache and multiplying them, then loading A[0,1] and B[0,1] and multiplying them, the code can load A[0,0] into cache and keep it there. It multiplies A[0,0] by B[0,0] and stores a temporary result. Then it multiplies A[0,0] by B[0,1] and stores a temporary result. Only then does it move on to A[0,1]. This minimizes how many times each submatrix is loaded into cache.

The final advantage is this simplifies writing a fast matrix-matrix multiplication routine. Instead of trying to come up with code that is fast for any size matrix, time can be spent coming up with an insanely fast routine for multiplying 3 x 3 matrices. Or 10 x 10. Or whatever size fits the cache best. Once this insanely fast routine is complete, any pair of matrices can be partitioned into blocks of this optimal size.

Unfortunately, getting the data in a block from RAM into cache is not trivial. If the data is stored in a standard row-major or column-major array, special code is needed to copy the data from the main matrix array in RAM to specialized submatrix arrays that can be loaded into cache. If this code is slow, block multiplication will be slower than transposing the second matrix. Fast array copy code often requires programming language features unavailable in most recent programming languages.

The other way of handling this is to store the data in submatrix-major order. This has its own drawbacks that are covered in the conclusion. For an implementation that works in most languages, please see Implementing a High Performance Matrix-Matrix Block Multiply.

Conclusion and concerns are in part four.

Basic Cache Optimization – Part 2

This is part two in a series on basic cache optimization. Please start with part 1.

The first optimization improved the dot product calculation. But the order in which the dot products were calculated resulted in each partial row of A being loaded into cache multiple times. The overall matrix multiplication calculation would run faster if each partial row of A was used as many times as possible before overwriting it in cache.

Second Optimization

A and B are the same as in part one. The cache, CA, still holds six elements. The memory manager retrieves three elements from RAM at a time. As in the first optimization, the calculation has been modified to use transposed B, BT. On paper, the matrices are:

NewImage

In RAM, they are:

NewImage

When multiplying two matrices, C = A * B, each element of C, C[i, j], is the dot product of the i-th row of A and the j-th row of BT.

There is no rule that a dot product must be calculated all at once. The formula could break each dot product into sections. C[i, j] is the dot product of the first three element pairs of the i-th row of A and the j-th row of BT plus the dot product of the second three element pairs of the i-th row of A and the j-th row of BT.

In terms of reading data from RAM, the first optimization did this. Each read from RAM brought in three useful elements. Making the calculation explicitly work with three element pairs at a time lets the programmer work better with the cache. Working with three element pieces, or blocks, of each matrix row lets the programmer control, to an extent, what blocks are kept in cache and for how long.

If the code calculates the dot products of all the first three element pairs, then all of the second three element pairs, the number of RAM accesses is reduced. When C[0,0] is calculated, the first three elements of the first rows of A and BT are loaded into cache and their dot product is calculated. This is the same as in the first optimization:

NewImage NewImage

Instead of continuing on to the next three elements of the first rows of A and BT, the code saves this intermediate dot product. The code then calculates the dot product of the first three elements of the first row of A and the first three elements of the second row of BT.

NewImage

The cache is now:

NewImage

This process is repeated using the first three elements of the remaining rows of BT. The code is now finished with the first three elements of the first row of A. They were loaded into cache once, used, and can be overwritten.

The code then calculates intermediate dot products for the second three elements of the first row of A and the second three elements of each row of BT. These intermediate dot products are added to the ones stored earlier. The result of this set of calculations is the entire first row of C.

This process is repeated for the remaining rows of A. Once these are complete, all of C has been filled in. This change in calculation order yields a large improvement in the number of RAM accesses.

For the example multiplication, A has nine rows and six columns. B has six rows and six columns. C has nine rows and six columns. The cache, CA, holds six elements and the memory manager loads three elements into cache at at time. Each of the methods requires the following number of RAM accesses.

Method          RAM Accesses
Basic           8 * 54 = 432
First Opt.      4 * 54 = 216
Second Opt.     14 * 9 = 126

The first and second optimizations require transposing B. This takes time and uses memory. There is a better way of optimizing matrix multiplication for the cache.

To be continued in part three.