A
cache-
optimal algorithm is one that
minimizes the
number of
cache misses.
By way of example, first consider the straightforward matrix multiplication algorithm to multiply n x n matrices A and B to obtain C:
OrdMult(A, B, C, n):
for i = 0 to n
for j = 0 to n
for k = 0 to n
C_ij = A_ik * B_kj;
If we assume the (Z,L)
ideal-cache model, we get the following
cache-complexity analysis:
Since for each of the n rows of A we must read in B all over again, we have n^2/L cache misses for each row of A. Thus, the cache complexity of OrdMult is Θ(n^3/L). OrdMult is not cache-optimal.
A cache-optimal algorithm for matrix multiplication is as follows:
First, view the matrices A and B as consisting of n/s x n/s sub-matrices, each of size s x s. We choose s such that three s x s matrices will fit into memory at the same time, i.e. s = Θ(sqrt(Z)). (Clearly, this assumption means that the algorithm is cache-aware and not a cache-oblivious algorithm.) We will then multiply these sub-matrices in blocks:
BlockMult(A, B, C, n, s):
for i = 0 to n/s
for j = 0 to n/s
for k = 0 to n/s
OrdMult(A_ik, B_kj, C_ij, s);
Since each OrdMult call reads three s x s sized
matrices into
memory, it incurs s^2/ L
cache misses. Thus, the overall
cache complexity is
Θ((n/s)^3*s^2/L) =
Θ(n^3/(sL)) =
Θ(n^3/(
sqrt(Z)L)).
BlockMult is cache-optimal.
In practice, OrdMult is about O(n^4), and this is because of cache misses. BlockMult runs much closer to the theoretical Θ(n^3) value.
Exercise for the reader:
Divide and conquer matrix multiplication is a simple algorithm in which we divide the matrices into four equally-sized sub-matrices:
C_11 = A_11 * B_11 + A_12 * B_21
C_12 = A_11 * B_12 + A_12 * B_22
...
Prove that
divide and conquer matrix multiplication, a
cache-oblivious algorithm, is
cache-
optimal (i.e. that the
cache complexity is
Θ(n^3/(
sqrt(Z)L))).