A Quick Introduction to Matrix-Matrix Multiplication

Multiplying two matrices together is most easily explained with an example. Once we work through the example, we’ll express what we did in words.

Imagine we have two matrices, A and B. A has 5 rows and 3 columns. B has 3 rows and 4 columns. Multiplying A * B results in C, which has 5 rows and 4 columns.

NewImageThe numbers down the left of each matrix are the row numbers. The ones across the top are the column numbers.

We refer to an element in a matrix by its row and column position: Matrix Name[row, column]. A[1,1] is 1. B[2,3] is 6. C[5,4] is… well, we don’t know what C[5,4] is.

NewImageThe shaded cells are the elements referred to.

To calculate C[1,1], we need the first row of A and the first column of B.

NewImage

We need to take the dot product of the first row of A and the first column of B. The dot product is represented by a dot: x • y. Dot product means multiplying each element of A’s first row by the corresponding element of B’s first column, then adding up the result of the multiplications. Put another way:

A's Row 1 • B's Column 1 = A[1,1] * B[1,1] + A[1,2] * B[2,1] + A[1,3] * B[3,1] 

It’s easier to calculate if we write it down. First, we’ll take just the row and column we need.

NewImageThe numbers down the left are the row numbers.
The ones across the top are the column numbers.

Next, we’ll write B’s first column underneath A’s first row, lining up A’s columns with B’s rows.

NewImageThe numbers across the top represent column numbers for A and row numbers for B.

Then we’ll multiply the numbers in each column and write each result underneath. 1 * 12 is 12, 2 * 8 is 16, and 3 * 4 is 12.NewImageFinally, we add them together. 12 + 16 + 12 = 40NewImage

And put the result in C[1,1].

NewImage

The remaining elements of C are calculated in a similar manner. C[1,2] is the dot product of the first row of A and the second column of B. C[1, 3] is the dot product of the first row of A and the third column of B.

Even though A only has three columns, we’re taking the dot product of rows of A and columns of B. Since B has another column, we take the dot product of the first row of A and the fourth column of B to calculate C[1, 4]. If B had additional columns, we would continue calculating dot products using the first row of A until we ran out of columns.

The remaining rows of A follow the same pattern. C[2,1] is the dot product of the second row of A and the first column of B. C[3,2] is the dot product of the third row of A and the second row of B. And so forth.

The general rule is C[i, j] equals the dot product of the i-th row of A and the j-th column of B. We’ll do one more example. C[5, 3] is the dot product of the fifth row of A and the third column of B.

NewImage

The dot product of A’s row 5 and B’s row 3 is 244.

NewImage

So C[5, 3] is 244. Let’s fill in the rest of C.

NewImage

Matrix Multiplication Formula

Taking what we learned from the above example:

If A is a matrix with M rows and P columns and B is a matrix with P rows and N columns, A * B results in a matrix C. C has M rows and N columns. Each element in C, C[i, j], is the dot product of the i-th row of A and the j-th column of B. Two matrices can be multiplied if and only if A has the same number of columns that B has rows.

Implementing a High Performance Matrix-Matrix Block Multiply

A basic block matrix-matrix multiply code divides the operands into logical sub-matrices and multiplies and adds the sub-matrices to calculate the product. For A * B = C and a block size of B:

for (int blockRow = 0; blockRow < A.Rows; blockRow += B) {
    for (int blockColumn = 0; blockColumn < B.Columns; blockColumn += B) {
        for (int blockK = 0; blockK < A.Columns; blockK += B) {
            for (int row = blockRow; row < MIN(blockRow + B, A.Rows); ++row) {
                for (int column = blockColumn; column < MIN(blockColumn + B, B.Columns); ++column) {
                    for (int k = blockK; k < MIN(blockK + B, A.Columns); ++k) {
                        C[row, column] += A[row, k] * B[k, column];
                    }
                }
            }
        }
    }
}

While this code is easy to understand, it relies on the compiler to treat the matrices as blocks. Many if not most compilers do this poorly. Even with significant optimization, this code cannot overcome the limitations of most commonly used matrix data structures.

In order to implement a high performance matrix-matrix block multiply, the data structure and implementation code must be carefully optimized. In this article, the block size is B, the matrices are A * B = C, and the matrix elements are double-precision (64-bit) floating point numbers.

The Data Structure

Instead of relying on the compiler to optimize the basic block multiply code, the matrix data will be stored directly in blocks. While a matrix M has X logical rows and Y logical columns of double-precision elements, the actual structure will be an array of arrays. M has CEILING(X / B) rows and CEILING(Y / B) columns. Each physical element of M is a one-dimensional array of doubles with B * B elements. While using an array of arrays of an existing matrix class may be tempting, overall performance is unlikely to be adequate.

This design will result in a small amount of wasted memory. The worst case, where both the number of rows and number of columns are integer multiples of B plus one additional element, will waste X * (B – 1) * sizeof(double) + Y * (B – 1) * sizeof(double) bytes. For a 100,000 x 100,00 matrix, this adds  ~47 MB to the matrix’s ~75 GB size.

The Multiplication Algorithm

Since a matrix of matrices is still a matrix, the standard matrix-matrix multiplication algorithm holds. Each element C[i, j] is the dot product of the i-th row of A and the j-th column of B. The dot product of two vectors of matrices is the sum of the product of the corresponding vector elements.

The Implementation Code

// C99/C++-style code. If using C#, substitute double[] for double*

for (int blockRow = 0; blockRow < A.blockRowCount; ++blockRow) {
    for (int blockColumn = 0; blockColumn < B.blockColumnCount; ++blockColumn) {
        double* subC = C[blockRow, blockColumn];
        for (int blockK = 0; blockK < A.blockColumnCount; ++blockK) {
            double* subA = A[blockRow, blockK];
            double* subB = B[blockK, blockColumn];
            for (int row = 0; row < B; ++row) {
                for (int column = 0; column < B; ++column) {
                    for (int k = 0; k < B; ++k) {
                        subC += subA[row * B + k] * subB[k * B + column];
                    }
                }
            }
        }
    }
}

Average single-threaded performance should be near 98% of theoretical, non-SIMD maximum for matrices over 2000 x 2000. Average multi-threaded performance should be over 95% for matrices over 3000 x 3000. Performance for square matrices should increase as matrix size increases.

If these performance levels are not reached via compiler options, some hand-optimization may be necessary. Once the optimal block size is known, unrolling the k loop will provide a significant speed improvement. Changing the order of the blockRow, blockColumn, and blockK loops may improve performance by a few percentage points.

Determining Block Size

Experimentation will determine the optimal block size. Sqtr(L1 Cache Size / (4 * sizeof(double))) is a good place to start.

Parallelization Note

When parallelizing the code, special attention must be paid to the loop design. The loop design must guarantee each result sub-matrix subC is updated by a single thread. If the design itself cannot guarantee this, explicit locking must be used. If this is not done, random errors may result. These errors may only involve a tiny number of elements out of millions or billions. The test system must include comparing every element of the result to the result of a known-good matrix-matrix multiplication routine.

Final Notes

This combination of data structure, algorithm, and implementation code provides a high-performance, non-SIMD matrix-matrix multiplication routine. Maintaining 90%+ performance using SIMD instructions may require optimizations that go beyond what is possible in straight C/C++/C#.