Background
At the heart of most modern computers lies the central processing unit or
CPU. It is quite often the fastest and most expensive part of the machine.
The CPU operates on data and programs which reside in the computer's main
memory chips. The CPU fetches programs from the main memory one
instruction at a time and executes them. Quite often, the CPU will have to
fetch additional information from the memory, such as data on which a
calculation is to be performed. The CPU runs according to a clock. This clock
sends out pulses. At each pulse, the CPU will go one step further in this
fetch decode execute cycle.
From this very simple sketch, it should be clear that the communication
with main memory needs to be very fast: it would be unacceptable if it took a
CPU only one clock cycle to execute an instruction, but ten clockcycles to
fetch the instruction from main memory. It would be possible to construct
memory in such a manner that its access time is equal to that of the CPU's
internal registers. This way, a fetch would take only one clock
cycle. Unfortunately, such memory is very expensive. For reasons beyond the scope of this text, it would in addition be very complicated, if not impossible, to build a large memory like this. That is why normal
computers use slow, cheap dynamic RAM for memory. This RAM is in fact
about a factor ten slower than the CPU and it is the biggest bottleneck in
modern computers. When buying a computer, it would thus be wise to check out
its bus speed instead of the CPU speed.
Principle
To soften the problem outlined above, processors are equipped with a very
small amount of expensive on-chip memory, called a cache. The word cache
(pronounced 'cash') derives from the French word 'cacher' (to conceal). The
cache effectively hides the slower main memory from the CPU. The cache keeps
copies of small pieces of data that reside in the main memory. When the CPU
requests the data in a certain memory location, the hardware first checks if
a copy resides in the cache. If so, the cached data is used. If not, the
hardware loads a region of memory into a slot in the cache.
Why cache improves speed
Computer scientists have observed long ago that the memory access
behaviour of computer programs is far from random. The CPU does not request
values from all over the memory. Instead, there are always some hot spots,
some parts of memory that the CPU just keeps accessing repeatedly. This is
not surprising since computers are programmed by people that like to organise
data and computations into lists, variables, procedures, loops,
etc. A program usually executes the same procedures repeatedly, accesses
the same variables repeatedly, works its way through lists and pieces of
code, etc. This phenomenon is known as locality of reference. In a
sentence: locality of reference is the observed statistical property that a
program is most likely to request data equal to or close to previously
requested data.
With the property of locality of reference in mind, it makes sense to
cache an entire region of memory when a location from that region is
requested. With luck, the processor will access many locations in this region
in the future. Since the cache memory is very fast, this strategy can
increase the effective throughput of the CPU considerably.
Caching policy issues
Write Through versus Write Back
Obviously, when the CPU changes
data in a cached region, it needs to be written back to main memory at some
point in time. There are two policies that can be followed here. They are
write back and
write through. With write back, a region is marked as
'dirty' once its data have been altered. When the cache is full and regions
need to be evicted, the region is written back to memory. If it was not
dirty, it is just discarded, since the main memory holds the same copy.With
write through, the region is written to memory as soon as it is altered. The
former policy holds the best performance increase, since a region may be
written many times over for the cost of only one complete memory transfer,
whereas the latter costs one complete region transfer per write.
In some
situations, like multiprocessor environments or situations where
consistency and
robustness are issues, it might be necessary to write
through.
Region replacement
When the cache memory fills up, regions need to be evicted to make room
for new ones. The question arises which strategy should be used in order to
make the best use of the cache memory. The problem is essentially the same as
that of page replacement in a virtual memory system. Belady stated an
algorithm in 1966 which Mattson proved to be optimal in 1970. That
is, Belady's algorithm causes the least amount of cache misses of all conceivable algorithms. (A cache miss is when a piece of requested data is not in the cache and has to be fetched from memory. The opposite is called a cache hit.) The
algorithm is an obscenely simple one: "evict the region that will not be used
for the longest period of time". Unfortunately, this algorithm is not
causal. It requires knowledge of future events, which makes it impossible to
implement (at the time of writing). Many algorithms exist that approach the
optimal algorithm, like the Least Recently Used (LRU) algorithm, which
evicts the least recently used region, or the First In First Out (FIFO)
algorithm and many others. Preferably, hardware designers would use LRU,
because it is statistically the closest approximation of the optimal
algorithm. Unfortunately, implementing LRU is rather expensive and a
compromise is often chosen. To see why, read on!
Mapping
When a region is brought into the cache, there are two conflicting factors
involved when choosing a place to store the region. Those factors are
efficiency and money. First of all, we could have a fully associative
mapping. This means that the region can be placed anywhere in the cache.
This would permit an implementation of the LRU algorithm. However, such a
cache would make it somewhat difficult to find a specific piece of data
within the cache. Since any requested address can be anywhere in the cache,
the hardware required to translate linear addresses requested by the CPU to
cache locations needs to be fairly sophisticated, read: expensive.
Another strategy is to use direct mapping. In this scheme, each memory region
can be in at most one location in the cache. Another way to look at this is that certain memory regions share the same slot in the cache. So if a certain region is in the
cache, we always know exaclty where it is, since it has only one place to go. This makes for simple hardware, but it is
extremely inefficient. When two regions share the same cache slot they have to be continually swapped in and out if the CPU requires
both at the same time. This resembles the thrashing phenomenon found in
virtual memory systems that employ a local replacement policy, which was
described by Denning in 1968. It is especially unnecessary if the rest of
the cache is fully available.
A compromise can be reached in the form of a set-associative mapping. In
this scheme, each region has a limited amount (a 'set') of cache regions in
which it can reside. The hardware thus doesn't have to look through the
entire cache, because it knows there are just a few locations where the
sought after region can reside. The first few bits of the absolute memory
address are often used as a means to group regions. They are called the tag
bits. This way, each chunk of memory maps onto its own set of cache regions,
but within that region the mapping is fully associative.
Remark that both direct mapping and set-associative mapping inhibit the
implementation of a full LRU replacement strategy. They are cheaper to
implement however. An interesting way to look at this is that a direct
mapping allows no policy at all. The replacement algorithm is trivial since
each requested region has only one way to go. Thus, direct mapping requires
no special policy hardware. It also implies one of the worst possible
policies.
Region size
Another important design consideration for cache memories is the size of the cached regions. Again, many factors conflict. It is desirable to have a large region size because it increases the chances of another hit on that region. Large regions also make the controlling hardware considerably easier and cheaper. On the other hand, since cache space is limited, a small cache would be more desirable. Further, transferring regions to and from memory takes time; the longer the region, the more time it takes. In practice, finding a good region size is very much an empirical matter. The size of the cache itself is an economical one.
Leftovers
Multiple level caching
Effectively, a computer consists of layer upon layer of caches: the memory caches disk blocks, the on-chip cache caches memory, the registers cache the on-chip cache. There is really no limitation to add yet another layer of cache somewhere. For instance, between the on-chip cache and the main memory we could have another layer of memory that is faster and more expensive than main memory, but cheaper and slower than the on-chip cache. This too comes at an obvious price, but this kind of level 2 caching is quite common. So is level 3 caching. There are some natural limits however. At some point, building a large, fast memory will be cheaper than building dozens of layers of cache.
Translation Look-aside Buffers
In a virtual memory system, which will not be described in this node, a
special kind of cache is used in conjunction with the normal cache described
above. In a virtual memory system it is necessary to translate virtual
addresses to physical ones. It is also necessary to check if a physical
address is actually in memory or if it is residing on the hard disk.
Because this translation takes time (clock cycles) it pays off to keep the
information on recently accessed memory pages in a special cache: the
Translation Look-aside Buffer or TLB. On exactly the same grounds of
locality of reference, the CPU is likely to need access to the same page in
the future as it has just accessed. This is why the page descriptor of such
a page is kept in the TLB.
Some typical figures
If you have carefully read the above, you will have noticed that there is
much talk about probability and little about certainty. This needs to be
stressed: in most of the cases, caching increases speed and in most of the
cases, LRU works best. However, these are not laws of nature! There are
programs for which LRU performs very poorly, especially when traversing long
lists, and there are situations in which caching is slower than direct memory
access. These situations do occur in the field so to speak. Be warned. That
aside: your average application will have just about 90 percent of cache hits
for every memory reference on your average computer. Pretty good! The same
holds for the memory by the way: every memory reference that the cache makes
in a virtual memory system usually has about 90 percent of hits. The chance
of going to the disk is thus very dim. In a properly equipped system that is.
This is just to give an impression of course. Your mileage will definitely
vary.
Remarks
Caching, and the problems involved, is very similar to the mechanics of
paging in a virtual memory system and to the mechanics of block access on
disks. In fact, the main memory itself is used for caching disk blocks by
the disk driver, since disks are much slower than main memory. Each has its
own problems and peculiarities however, but generally, this type of managing
limitations in speed and capacity is what makes designing hardware and
operating system software so complex, and challenging perhaps. Problems like
these are found in operating systems on many, many levels. The manner in
which the implementer deals with them makes up for a great deal of the speed
and efficiency with which your machine operates. It also makes it very
difficult do compare, say a Macintosh with a G4 processor running MacOS X
to a PC with an Athlon running Linux.
Another thing to note about the LRU replacement algorithm described above, is that it requires the hardware to keep statistics on each region. It requires continuous knowledge of when each region was used. This causes considerable overhead. Many times a simple approximation is used that will only track the usage of each region for one to four cache misses. Still, updating all the statistics requires complex hardware.
Finally, the computer model described above had to be simplified somewhat. The process of fetching, decoding and executing is a fairly accurate representation of computing on a RISC architecture, like the PowerPC. On a CISC architecture, things are a bit more complex, since each instruction might execute a number of microinstructions, which in turn might access memory.