This is part two in a series on basic cache optimization. Please start with part 1.

The first optimization improved the dot product calculation. But the order in which the dot products were calculated resulted in each partial row of A being loaded into cache multiple times. The overall matrix multiplication calculation would run faster if each partial row of A was used as many times as possible before overwriting it in cache.

### Second Optimization

A and B are the same as in part one. The cache, CA, still holds six elements. The memory manager retrieves three elements from RAM at a time. As in the first optimization, the calculation has been modified to use transposed B, BT. On paper, the matrices are:

In RAM, they are:

When multiplying two matrices, C = A * B, each element of C, C[i, j], is the dot product of the i-th row of A and the j-th row of BT.

There is no rule that a dot product must be calculated all at once. The formula could break each dot product into sections. C[i, j] is the dot product of the first three element pairs of the i-th row of A and the j-th row of BT *plus* the dot product of the second three element pairs of the i-th row of A and the j-th row of BT.

In terms of reading data from RAM, the first optimization did this. Each read from RAM brought in three useful elements. Making the calculation *explicitly* work with three element pairs at a time lets the programmer work better with the cache. Working with three element pieces, or *blocks*, of each matrix row lets the programmer control, to an extent, what blocks are kept in cache and for how long.

If the code calculates the dot products of all the first three element pairs, then all of the second three element pairs, the number of RAM accesses is reduced. When C[0,0] is calculated, the first three elements of the first rows of A and BT are loaded into cache and their dot product is calculated. This is the same as in the first optimization:

Instead of continuing on to the next three elements of the first rows of A and BT, the code saves this intermediate dot product. The code then calculates the dot product of the first three elements of the first row of A and the first three elements of the *second* row of BT.

The cache is now:

This process is repeated using the first three elements of the remaining rows of BT. The code is now finished with the first three elements of the first row of A. They were loaded into cache once, used, and can be overwritten.

The code then calculates intermediate dot products for the second three elements of the first row of A and the second three elements of each row of BT. These intermediate dot products are added to the ones stored earlier. The result of this set of calculations is the entire first row of C.

This process is repeated for the remaining rows of A. Once these are complete, all of C has been filled in. This change in calculation order yields a large improvement in the number of RAM accesses.

For the example multiplication, A has nine rows and six columns. B has six rows and six columns. C has nine rows and six columns. The cache, CA, holds six elements and the memory manager loads three elements into cache at at time. Each of the methods requires the following number of RAM accesses.

Method RAM Accesses Basic 8 * 54 = 432 First Opt. 4 * 54 = 216 Second Opt. 14 * 9 = 126

The first and second optimizations require transposing B. This takes time and uses memory. There is a better way of optimizing matrix multiplication for the cache.

To be continued in part three.