There are three ways of viewing computers:

- Hobby: A computer is an end in itself
- Tool: A computer is an aid to an end
- Hindrance: A computer is a mandatory impediment to an end

Users, developers, software companies, operating systems, and hardware makers each have at least one of these views. Knowing their views can decrease frustration and help achieve your or your business’ goals.

“Computer” refers to both the computer and the operating system. When used without further detail, it can mean either one.

My use of “hobby” and “hobbyist” is not meant as an insult. The terms represent those who take interest in the computer itself. Someone who uses a computer as a tool may also be a hobbyist, but it isn’t required.

The hobby view considers the computer an end in itself. Overclockers, some gamers, some programmers, and some general computer users may enjoy playing with the computer’s hardware and operating system. Their changes may or may not objectively improve their computer experience, but these users enjoy the time they spend changing things. A tool view user may also have the hobby view.

This is a positive view of computers, but not a utilitarian one. Hobbyists’ computers are often overpowered for what they are used for and less stable than a tool user’s computer.

Hobby users generally:

- Want:
- The fastest hardware they can afford
- Operating systems that:
- Provide a challenge or
- Run the software they are most interested in

- Know:
- How to use their computer’s operating system at a deep level
- How to change hardware and software settings

- Upgrade:
- Components when newer, measurably better ones are available
- Computers when their old computer:
- Is significantly slower than newer machines or
- Can no longer host more modern components

- Switch operating systems:
- Frequently: May run multiple operating systems on a regular basis
- Rarely: May be devoted to a single operating system

- Handles problems:
- Well. Enjoys challenges or views them as part of being a hobbyist.

The tool view considers the computer an aid to an end. Writers, artists, educators, engineers, scientists, and consumers may view computers as tools. Each finds themselves more productive, more creative, and/or able to do more complex things than they would without a computer.

This is a positive view of computers. It’s also a utilitarian one. This user selects a computer based on how well it advances the user’s goals considering the user’s current and expected future situations. Many programmers and gamers have the tool view of computers.

Tool users generally:

- Want:
- Reliable computer hardware
- Reliable, non-cryptic computer operating systems

- Know:
- How to use their computer’s operating system as it impacts their goals
- How to use their preferred applications

- Upgrade:
- Components when small needs aren’t being met
- Computers when large needs aren’t being met

- Switch operating systems:
- Rarely. Training costs usually outweigh any other perceived gain

- Handles problems:
- Well. Finds the solution, learns it, and uses it.
- Remembers the problems when making future purchases.

The hindrance view considers computers or operating systems an impediment but is required to use them. An educator who prefers overhead projectors but must use the digital equivalent may consider computers in general an impediment. An accountant may love spreadsheets but find the operating system an impediment.

Hindrance users generally:

- Want:
- The computer to stay out of their way
- Hardware that never fails in any way
- Operating systems that never change

- Know:
- How to use a few specific applications
- This may be at a superficial or deep level
- The educator may know nothing about the digital overhead projector
- The accountant may know everything about Excel

- Very little about the hardware or operating system

- How to use a few specific applications
- Upgrade:
- When forced
- The computer no longer functions
- Their employer replaces the machine

- When forced
- Switch operating systems:
- Never

- Handles problems:
- Poorly. Has not learned to use the machine at a base level, so normal operation is a “problem”

Knowing your view and software and hardware makers’ views is critical for both personal and business computing. A hindrance user shouldn’t buy a computer designed for a hobbyist. A tool-focused business should not use software from hobbyist-oriented projects/companies.

Future articles will go into determining where you are and where others are in more detail.

]]>While all posts will still live in the Blog section, I’ve added two new menu items: Computing and Cycling. If you are only interested in posts about computing, select the Computing menu option. If you are only interested in Cycling, select that menu option. If there is interest, I will look into creating separate RSS feeds.

No existing URLs have changed. Articles I feel are of special note will still be highlighted on the front page of the site.

]]>While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly.

To demonstrate how well compilers optimize matrix-matrix multiplication codes, including basic block multiplication, I benchmarked four versions of my matrix-matrix multiplication test bed. Under Linux (Fedora Core 18), I compiled the C++ version using GNU’s G++ 4.7 and Intel’s ICPC 13.1.1 from Composer XE 2013 for Linux. Under Windows 7 SP1, I compiled the Visual C++ and C# versions using Visual Studio 2012.

Both operating systems were fully updated when I ran the benchmarks and were run on the same bare hardware. ICPC and G++ used the -O3 flag. Visual C++ used the /O2 flag and architecture-specific optimizations were not specified. C# had optimization on. All of these compilers can use SIMD instructions to a greater (ICPC) or lesser (C#) extent. Given the test results, I suspect ICPC was using SIMD in a couple test cases. I may revisit these tests with SIMD explicitly turned off in the future.

I benchmarked three matrix sizes: 1000 x 1000, 5000 x 5000, and 15000 x 15000. For the 1K and 5K tests, I used three element storage types: one-dimensional array (1D), array of arrays (AA), and array of arrays of 1024 element one-dimensional arrays (FM).

For 1D and AA, I ran three multiplication methods. Transpose, where the second matrix is transposed and a modified multiplication algorithm is used. Block, using the basic block multiply code presented in IaHPM. Transpose Block, where the second matrix is transposed and a modified block multiply code is used. The FM multiplication used an algorithm similar to the modified block multiply algorithm presented at the end of IaHPM. For the 15K tests, I only ran the fastest method for each of the Windows compilers.

All algorithms used accumulators. Block routines that use MIN had the MIN calculated outside of the for loop test. The FM routines had the inner loop unrolled. Values slightly over 100% may be due to inaccurate CPU speed values. Values significantly over 100% may be due to SIMD instruction use.

The 1000 x 1000 case is somewhat trivial: all of the elements necessary for a dot product will fit into the L1 cache. ICPC does very well at optimizing the transpose methods. Note that C# did not optimize the block methods very well. G++ and VC++ recognize the block code: the block routines are nearly the same speed as the basic transpose method.

ICPC optimized the transpose block methods well. However, G++ and VC++ take the lead using the specialized FM data structure and routine.

ICPC is still in third place, but not by much. C# makes a surprisingly strong showing. I may update this chart with the VC++ and C# values for the other multiplication methods in the future.

Is creating specialized data structures and calculation methods worth the effort? It depends on the situation. For matrix-matrix multiplication, probably not. ICPC did a great job on the transpose block algorithm. This allows the use of a one-dimensional array, which allows many standard libraries to be used.

Note: Please do not use your own matrix-matrix multiplication routines in a production system without having an amazingly good reason to do so. Use ATLAS, Intel’s MKL, cuBLAS, Math.Net, or some other BLAS implementation.

The critical question when it comes to optimizing compilers is: how easy is it to get an optimizing compiler to optimize *your* code. Compilers only optimize situations they recognize. If it optimizes your code as written, that’s great. If it doesn’t, is it better to spend time working with the compiler or creating better data structures and code?

There isn’t an easy answer to this question. If your routine has to work across multiple compilers, it may be better to work on the implementation details. If you know you’ll be using an optimizing compiler on a regular basis, it may be better to learn the intricacies of that compiler. If processing speed is *the* competitive advantage for your research or business, it may be best to do both.

To get an optimizing compiler to optimize a piece of code, the code may need to be written in a specific way. Additionally, optimal performance may require testing different combinations of compiler flags. As the above results show, just putting an existing piece of code into an optimizing compiler using a “normal” set of compiler flags does not guarantee good results.

]]>But they don’t. Additionally, the performance hit is split differently than in C#. In C#, most of the performance loss is in going from the indexer method to the generic method. In C++, most of the performance loss is in going from the regular method to the indexer method.

My C++ matrices all inherit from the MatrixBase virtual class. Three indexer methods, GetValue, SetValue, and AddValue, allow for generic matrix-matrix multiplication routines. All three are implemented as inline methods in every class header file.

The execution time of multiplying two 1000 x 1000 double-precision matrices 20 times is disappointing. The code was compiled in Visual C++ under Visual Studio 2012 with /Ob2, /O2, /Oi, and /Ot. All listed methods use accumulators, store data in arrays of arrays, and run in parallel across four cores. The figures shown are elapsed time in seconds for the run.

Acc 5.361 Acc-Indexer 29.939 Acc-Template 40.211

Other multiplication and storage methods had similar differences. C# seemed to do well inlining the indexers then had an optimizer issue implementing the generics.

VC++ seemed to have issues inlining the indexers, then did an OK job implementing the templates. Why? Is this a compiler-specific issue? I hope to test this in g++ and ICL, though this is low on my priority list.

Since templates are not critical to the core sections of my code, I will not be going through the generated assembly code. Templates work well in the non-critical sections. I recommend using templates to simplify testing and in the routines that organize benchmarking. But I can’t recommend them for performance-critical applications, at least not using Visual C++.

These performance differences surprise me and could be my fault. If you have a suggestion to improve template performance, please leave a comment.

]]>I have written a matrix-matrix multiplication testbed. The source code is available on GitHub. It explores far more than what was covered in this series, including the impact of accumulators, indexers, parallelism, underlying data structures, and so forth. The system does manage maximum theoretical, non-SIMD CPU performance. The code is in C# but easily converts to C++ and the concepts are applicable to most languages. If there is enough demand, a C++ version of the test bed will be released. The solution loads and compiles under Xamarin Studio and runs under Mono.

The test system is a 3.4 GHz Core i7-2600K “Sandy Bridge” CPU with 16 GB of RAM. Turbo Boost was disabled to allow a more accurate estimate of the potential performance of the system. Hyper-Threading was also disabled. Only the operations required for multiplication were measured. Creating and filling test matrices with random data was not included in the performance figures. For methods that required a matrix transposition, the transposition time was included. All matrix elements are double-precision (64-bit) floating point values.

The performance figures compare a straightforward matrix-matrix multiplication routine to a transposition routine and a block multiplication routine. The latter two correspond to the first and third optimizations. All routines use accumulators.

I have not written a version of the second optimization for reasons beyond the scope of this article. The first two routines use an array of arrays to hold the matrix elements. The third uses an array of arrays of 1024 element one-dimensional arrays. This is the equivalent of a 32 x 32 block size.

Performance figures shown are a percentage of the test system’s theoretical maximum possible MFLOPS without SIMD. The theoretical maximum is approximately 3400 MFLOPS for single-core routines and 13600 MFLOPS for four-core. Both values were treated as exact for calculation purposes.

The MFLOPS of each operation was calculated using the average time from 100 runs for 1000 x 1000 matrices, 20 runs for 5000 x 5000, and 5 runs for 10000 x 10000 and 15000 x 15000. The number of floating point operations was calculated using the minimum possible for a square matrix of size N x N: 2 * N^3 – N^2. This slightly understates the actual number of operations performed, thus understating the actual MFLOPS.

Given the performance improvement, the complexity of the block algorithm compared to the transpose algorithm may not seem to justify the time investment. However, once larger matrices are multiplied, the benefits become apparent.

For comparison purposes, here is Math.Net‘s Intel MKL-based matrix-matrix multiplication vs the block algorithm. This chart shows MFLOPS, not percentage of maximum performance. Both algorithms use all four cores. The theoretical maximum of the test machine using SIMD on four cores is 109 GFLOPS.

]]>

The first three parts of this series walked through optimizing dot products and matrix-matrix multiplication for the cache. The general guidelines for cache optimization are partitioning, proximity, load speed, and load frequency.

The calculation must be divisible into subcalculations. Both the dot product and matrix-matrix multiplication calculations can be divided. Dot product can be done in groups and matrix-matrix multiplication can be done by multiplying submatrices.

The data must be divisible into blocks. The blocks must be sized such that as many as needed fit into cache simultaneously. Both examples were readily divisible. The dot products were divided into three element blocks. The matrices were divided into 3×3 submatrices. Two three element dot product blocks fit into cache at once, as did two 3×3 submatrices.

Partitioning the data into blocks isn’t enough. When partitioned, the elements necessary for the divided calculation must all lie within the same block. Both examples had good proximity. The three element dot product blocks contained all the elements needed for that part of the dot product. The 3×3 submatrices for matrix-matrix multiplication contained all of the elements necessary for their multiplication.

Moving the data from the blocks into cache must be fast. This can be done in two ways. The data can be stored in RAM in structures designed so the memory manager automatically loads a block into cache. This requires using specialized data structures and may cause problems when moving data from one system to another.

The alternative is to copy a block’s elements from RAM into a special data structure the memory manager automatically loads into cache. Writing a high speed data copy routine is not trivial. Many languages do not have the features necessary to write one. Writing the routine in one language and calling it from another involves overhead. This overhead may slow the call down more than writing it in another language sped it up.

The dot product example had great load speed. The elements were in one-dimensional arrays. The memory manager automatically loaded the elements into cache in three element blocks.

The matrix-matrix multiplication calculation, as written, has horrible load speed. While the code may think of the data as being in submatrices, the elements in RAM are still in a giant, row-major, one-dimensional array. The data structure must be modified to store the elements in actual submatrices or special code will be needed to copy the submatrix elements from RAM into cache. Please see Implementing a High Performance Matrix-Matrix Block Multiply for one way of storing the matrix elements in submatrices.

The final piece is ensuring the blocks *as a group* are loaded into cache as few times as possible. This requires looking at the big picture. Minimizing loads for a part of the calculation may prevent the calculation as a whole from being as fast as possible.

The dot product calculation automatically had the lowest possible load frequency. Each three element block was only needed once and was only loaded into cache once. Unfortunately, knowing how to optimize one part of a calculation may result in suboptimal performance for the calculation as a whole.

In the first matrix-matrix multiply, using the optimized dot product routine resulted in inefficient cache use. Each block of a row of A was needed multiple times and was loaded multiple times. The second optimization changed the order in which matrix-matrix multiplication was calculated, which resulted in each block of A being loaded only once. The third optimization also did this.

The only way to approach the maximum calculation potential of a CPU is by optimizing the calculation for the cache. The high clock speeds of modern CPUs have made RAM access prohibitively expensive. Even non-SIMD calculations written without cache optimization run more than 10x slower than a cache-optimized version.

Running an inefficient calculation on many CPUs is often considered a better solution than cache optimization. While this may raise throughput, latency remains an issue and increases the cost of mistakes. Unknowingly tainting a 10 minute calculation costs 10 minutes. Unknowingly tainting just one of ten 100 minute calculations costs 100 minutes.

SIMD optimization is predicated on cache optimization. The pinnacle of CPU performance is out of reach for those who do not optimize for the cache. The difference in performance between a SIMD-optimized calculation and one not optimized for the cache can be a factor of 1000. This is the difference between taking over eight hours and taking five minutes.

Following the guidelines above may result in massive speed improvements for a code. If a programmer plans on using SIMD, cache optimization is the first step. However, cache optimization has some perils.

Before doing any cache optimization, a comprehensive test system must be available. If a tested, known-good reference routine is not available, start with the simplest implementation of the calculation. Once that is well-tested, use it as a validator. Cache optimization often involves data structures that create fencepost and/or symmetry problems. Tests must cross over partition boundaries, use non-integer multiples of block sizes, and so forth.

Partitioning schemes must take into account the nature of the data. Does the data naturally partition itself into the chosen block size? Will it continue to do so? Does partitioning happen invisibly to the end-programmer? Or do they have to code for it? All of these questions potentially limit the portability and lifespan of the code.

Cache-optimized calculations will be mathematically identical to the original calculation. But they may not have an identical number of operations. When working with floating point values, this may result in small differences between the two calculations. Additionally, care must be taken to avoid significant digit problems. If the code must match a reference calculation exactly, cache optimizations may be limited or impossible.

Intentionally picking the order of operations in a calculation allows the best optimization the programmer is capable of envisioning. Unfortunately, this increases development time and limits the optimization to what the programmer can conceptualize. Whether this level of optimization is better or worse than what the compiler would produce depends on the compiler and the programmer. Additionally, some compilers may produce better output than a programmer only when “tricked”. Learning how to “trick” the compiler takes time and these tricks may change between minor releases of the compiler. Whether it is better to optimize the code by hand or play compiler games depends on the programmer, the compiler, and the situation.

Different CPUs use different cache sizes. As of 2013, almost all x86-compatible laptop, desktop, and server CPUs use 32 KB L1 data caches. Many tablet and phone CPUs use 16 KB data caches. Some older ones use 8 KB data caches. Whether this is an issue depends on the purpose of the code.

If data partitioning is done via a custom data structure, custom support routines will be needed. Using generic support routines via indexers or mapping objects may leave the system with one amazingly fast calculation and many more extremely slow ones. Writing and testing custom support routines and converters is non-trivial.

This series deals with optimizing for the L1 data cache. When writing code destined for SIMD optimization, this may not be enough. Optimizing beyond the L1 cache is beyond the scope of this article. To get an idea of what is involved, please follow the links from Kazushige Goto’s Wikipedia entry.

Please continue on to the performance comparisons.

]]>The second optimization improved the speed of the overall calculation. But, like the first optimization, it required transposing the second matrix. Transposition requires time and memory. However, dividing the rows into blocks was a good start.

In order to explain the third optimization, we need a cache more like one used in a real CPU. The cache now holds eighteen elements. The memory manager now pulls in nine elements at a time. The example matrices are the same. On paper:

and in RAM:

In the second optimization, each row was divided into two halves. Each half was used to calculate a partial dot product. As the calculation continued, the partial dot products were summed. Once all of the partial dot products were calculated and summed, C = A * B was complete.

Dividing the rows into halves divided, or *partitioned*, them into *blocks*. In this case, each block was one row high by three columns wide. While BT wasn’t explicitly partitioned by the code, the memory manager effectively partitioned BT by loading three elements of each row at a time.

A larger cache can hold more elements. The 18 element cache could be used to hold an entire row of A and BT. But that still requires transposing B to create BT. It would be better if B didn’t need to be transposed in order to use the cache effectively. The two matrices are once again A and B, both on paper:

and in RAM:

The second optimization divided each row into three element blocks. This was possible because a dot product can be calculated in parts. However, a row divided into blocks is also a matrix. A three element block is a 1 x 3 matrix.

This means A and B can be divided into sizes other than 1 x 3. The formula for multiplying a matrix that has been divided into blocks is the same as the normal matrix-matrix multiplication formula. The only difference is that the elements of A, B, and C are now matrices. Partitioning A and B into 3×3 submatrices:

turns them into a 3 x 2 matrix and a 2 x 2 matrix:

Since multiplying two matrices that are composed of submatrices follows the normal multiplication rules, A * B = C looks like:

Following the rules, C[0,0] = A[0,0] * B[0,0] + A[0,1] * B[1,0]. Matrix multiplication has already been defined, so calculating A[0,0] * B[0,0] and A[0,1] * B[1,0] is simple. Adding two matrices requires adding corresponding elements, which is also trivial.

One advantage of this is the submatrix size can be tailored to the cache size. Two 3 x 3 matrices will fit into an eighteen element cache. This makes multiplying two 3 x 3 matrices very fast, much like dividing rows into three element pieces made calculating dot products very fast.

Another advantage is the order of operations can be tailored to use the cache efficiently. Just like the second optimization’s dot products, the submatrix multiplications can be partially calculated. Instead of loading A[0,0] and B[0,0] into cache and multiplying them, then loading A[0,1] and B[0,1] and multiplying them, the code can load A[0,0] into cache and keep it there. It multiplies A[0,0] by B[0,0] and stores a temporary result. Then it multiplies A[0,0] by B[0,1] and stores a temporary result. Only then does it move on to A[0,1]. This minimizes how many times each submatrix is loaded into cache.

The final advantage is this simplifies writing a fast matrix-matrix multiplication routine. Instead of trying to come up with code that is fast for any size matrix, time can be spent coming up with an insanely fast routine for multiplying 3 x 3 matrices. Or 10 x 10. Or whatever size fits the cache best. Once this insanely fast routine is complete, any pair of matrices can be partitioned into blocks of this optimal size.

Unfortunately, getting the data in a block from RAM into cache is not trivial. If the data is stored in a standard row-major or column-major array, special code is needed to copy the data from the main matrix array in RAM to specialized submatrix arrays that can be loaded into cache. If this code is slow, block multiplication will be slower than transposing the second matrix. Fast array copy code often requires programming language features unavailable in most recent programming languages.

The other way of handling this is to store the data in submatrix-major order. This has its own drawbacks that are covered in the conclusion. For an implementation that works in most languages, please see Implementing a High Performance Matrix-Matrix Block Multiply.

Conclusion and concerns are in part four.

]]>The first optimization improved the dot product calculation. But the order in which the dot products were calculated resulted in each partial row of A being loaded into cache multiple times. The overall matrix multiplication calculation would run faster if each partial row of A was used as many times as possible before overwriting it in cache.

A and B are the same as in part one. The cache, CA, still holds six elements. The memory manager retrieves three elements from RAM at a time. As in the first optimization, the calculation has been modified to use transposed B, BT. On paper, the matrices are:

In RAM, they are:

When multiplying two matrices, C = A * B, each element of C, C[i, j], is the dot product of the i-th row of A and the j-th row of BT.

There is no rule that a dot product must be calculated all at once. The formula could break each dot product into sections. C[i, j] is the dot product of the first three element pairs of the i-th row of A and the j-th row of BT *plus* the dot product of the second three element pairs of the i-th row of A and the j-th row of BT.

In terms of reading data from RAM, the first optimization did this. Each read from RAM brought in three useful elements. Making the calculation *explicitly* work with three element pairs at a time lets the programmer work better with the cache. Working with three element pieces, or *blocks*, of each matrix row lets the programmer control, to an extent, what blocks are kept in cache and for how long.

If the code calculates the dot products of all the first three element pairs, then all of the second three element pairs, the number of RAM accesses is reduced. When C[0,0] is calculated, the first three elements of the first rows of A and BT are loaded into cache and their dot product is calculated. This is the same as in the first optimization:

Instead of continuing on to the next three elements of the first rows of A and BT, the code saves this intermediate dot product. The code then calculates the dot product of the first three elements of the first row of A and the first three elements of the *second* row of BT.

The cache is now:

This process is repeated using the first three elements of the remaining rows of BT. The code is now finished with the first three elements of the first row of A. They were loaded into cache once, used, and can be overwritten.

The code then calculates intermediate dot products for the second three elements of the first row of A and the second three elements of each row of BT. These intermediate dot products are added to the ones stored earlier. The result of this set of calculations is the entire first row of C.

This process is repeated for the remaining rows of A. Once these are complete, all of C has been filled in. This change in calculation order yields a large improvement in the number of RAM accesses.

For the example multiplication, A has nine rows and six columns. B has six rows and six columns. C has nine rows and six columns. The cache, CA, holds six elements and the memory manager loads three elements into cache at at time. Each of the methods requires the following number of RAM accesses.

Method RAM Accesses Basic 8 * 54 = 432 First Opt. 4 * 54 = 216 Second Opt. 14 * 9 = 126

The first and second optimizations require transposing B. This takes time and uses memory. There is a better way of optimizing matrix multiplication for the cache.

To be continued in part three.

]]>If all of the data required for a calculation fits into cache simultaneously, no cache-specific optimization is necessary. When the data required for a calculation is too large to fit into the cache, it must be processed in blocks. These blocks should contain all necessary data to complete a subset of the calculation. If possible, they should also contain the data necessary to complete the next subset of the calculation. Failing to do this efficiently will result in unnecessary RAM accesses and longer execution times.

Modern CPUs have multiple cores and multiple caches. Commonly, each core has its own cache (the L1 cache) and shares one or more other caches (L2, L3, perhaps others) with the remaining cores. This article uses vector dot product and matrix-matrix multiplication to demonstrate optimizing for the L1 cache. Depending on the data characteristics and potential processing speed of a particular calculation, optimizing for other caches may be useful.

Some calculations may be inherently cache optimized. Vector dot product is one. Take two nine element vectors, X and Y, stored as one-dimensional arrays. The CPU has a six element cache, CA. To compute X • Y, corresponding elements of X and Y are multiplied together and the resulting products are summed. System memory initially looks like:

The dot product code starts with the first elements of X and Y, X[0] and Y[0]. When the code requests these from RAM, the memory manager fetches X[0] and the values stored in the next several sequential memory locations. These values are copied into cache. It then performs the same process for Y[0].

Once the data has been retrieved, the code stores 1 * 11 in a temporary variable. (Writing to cache will be covered in a separate article.) If the memory manager fetches three elements at a time, system memory now looks like:

The code moves on to the second pair of elements. When it requests X[1] and Y[1], the memory manager retrieves each from cache. This replaces two slow RAM accesses with two fast cache accesses. The third element pair, X[2] and Y[2], is also retrieved from cache.

When the code requests the fourth element pair, the memory manager has to read them from RAM. It performs one read for X’s elements 4, 5, and 6 and another read for Y’s elements 4, 5, and 6. These six values are copied into the cache. System memory now looks like:

Elements pairs 4, 5, and 6 can be multiplied without accessing RAM. Element pairs 7, 8, and 9 require two more RAM accesses. This entire dot product calculation requires six reads from RAM: two reads each for the first, fourth, and seventh element pairs. These six reads load the remaining element pairs as a side-effect.

This calculation is inherently cache-optimized. The vectors are stored as one-dimensional arrays. The array elements are sequential in RAM. The array elements are accessed in sequence. The memory manager’s default actions are optimal for this calculation.

A programmer could short-circuit this inherent optimization. If X • Y were computed as X[0] * Y[0] + X[3] * Y[3] + X[6] * Y[6] and so forth, the calculation would take much longer than necessary. Every element pair would require two reads from RAM. The calculation would go from six RAM reads to eighteen.

Unfortunately, many data structures do not work well with how the cache is filled.

While programmers and researchers often use n-dimensional data structures, RAM is a one-dimensional array. Mapping a two-, three-, or more-dimensional array into one-dimensional RAM often leads to inefficient cache use. Matrix-matrix multiplication is a perfect example. If needed, please see A Quick Introduction to Matrix-Matrix Multiplication for a brief refresher.

Matrices are generally thought of as two-dimensional arrays. But what appears on paper as:

looks like this in RAM:

The first element of A * B is the dot product of A’s first row and B’s first column. On paper, this is simple.

In RAM, it is not.

When the code begins calculating the dot product, it requests A[0,0] and B[0,0]. Using the same six element cache and memory manager that loads three sequential elements at a time, the cache looks like this:

The first three elements of A’s first row are now in cache. The first three elements of B’s first row are also in cache. Unfortunately, we need the first three elements of B’s first *column*. Only the first element of B’s first row is useful.

Since one row of A and one row of B cannot fit into cache at the same time, the best scenario for a single dot product calculation is four reads from RAM. Two reads per row of A and two reads per column of B. But, as shown, the data structure doesn’t allow this.

With this data structure, each dot product requires eight RAM accesses: one for every three elements of a row of A plus one for every element of a column of B. A * B requires 54 dot products. The multiplication calculation went from 216 RAM accesses to 432 RAM accesses. A modern CPU core can do hundreds if not thousands of operations in the time one RAM access takes.

If B’s columns were in rows, they would be in sequential memory. This would result in better cache use because the memory manager would automatically load three useful elements of B into cache per RAM access. Transposing a matrix swaps the rows and columns. BT represents transposed B. On paper, the new matrices look like:

They look like this in RAM:

This change turns the cache for the first portion of the first dot product into:

This reduces the RAM accesses per dot product to four. Since this is the calculated minimum number of accesses per dot product, the data now seems optimized for the cache.

But it isn’t. Cache usage for calculating a single dot product is optimized. But the calculation multiplies two matrices. The second element of A * B is the dot product of the first row of A and the second column of B. Using BT, the calculation is the dot product of the first row of A and the second row of BT.

This will take an additional four RAM accesses. Two of these will be spent loading the first row of A into the cache *for the second time*. Optimizing the low-level calculation, in this case dot product, improved performance. However, optimizing the calculation as a whole will cut the number of RAM accesses almost in half for this example.

To be continued in part two.

]]>Imagine we have two matrices, A and B. A has 5 rows and 3 columns. B has 3 rows and 4 columns. Multiplying A * B results in C, which has 5 rows and 4 columns.

*The numbers down the left of each matrix are the row numbers. **The ones across the top are the column numbers.*

We refer to an element in a matrix by its row and column position: Matrix Name[row, column]. A[1,1] is 1. B[2,3] is 6. C[5,4] is… well, we don’t know what C[5,4] is.

*The shaded cells are the elements referred to.*

To calculate C[1,1], we need the first row of A and the first column of B.

We need to take the dot product of the first row of A and the first column of B. The dot product is represented by a dot: x • y. Dot product means multiplying each element of A’s first row by the corresponding element of B’s first column, then adding up the result of the multiplications. Put another way:

`A's Row 1 • B's Column 1 = A[1,1] * B[1,1] + A[1,2] * B[2,1] + A[1,3] * B[3,1] `

It’s easier to calculate if we write it down. First, we’ll take just the row and column we need.

*The numbers down the left are the row numbers.The ones across the top are the column numbers.*

Next, we’ll write B’s first column underneath A’s first row, lining up A’s columns with B’s rows.

*The numbers across the top represent column numbers for A and row numbers for B.*

Then we’ll multiply the numbers in each column and write each result underneath. 1 * 12 is 12, 2 * 8 is 16, and 3 * 4 is 12.Finally, we add them together. 12 + 16 + 12 = 40

And put the result in C[1,1].

The remaining elements of C are calculated in a similar manner. C[1,2] is the dot product of the first row of A and the second column of B. C[1, 3] is the dot product of the first row of A and the third column of B.

Even though A only has three columns, we’re taking the dot product of rows of A and columns of B. Since B has another column, we take the dot product of the first row of A and the fourth column of B to calculate C[1, 4]. If B had additional columns, we would continue calculating dot products using the first row of A until we ran out of columns.

The remaining rows of A follow the same pattern. C[2,1] is the dot product of the second row of A and the first column of B. C[3,2] is the dot product of the third row of A and the second row of B. And so forth.

The general rule is C[i, j] equals the dot product of the i-th row of A and the j-th column of B. We’ll do one more example. C[5, 3] is the dot product of the fifth row of A and the third column of B.

The dot product of A’s row 5 and B’s row 3 is 244.

So C[5, 3] is 244. Let’s fill in the rest of C.

Taking what we learned from the above example:

If A is a matrix with M rows and P columns and B is a matrix with P rows and N columns, A * B results in a matrix C. C has M rows and N columns. Each element in C, C[i, j], is the dot product of the i-th row of A and the j-th column of B. Two matrices can be multiplied if and only if A has the same number of columns that B has rows.

]]>