In computer science, a small amount of storage used to store recently used elements of a larger amount of storage in the hope that the time to access a data element is reduced.

The two rules in computer science:
1) If too slow, use caching
2) For flexibility/utilization, use indirection

Cache is also the product name for an object-oriented database system produced by Intersystems Corp. It runs on Wintel, various flavors of Unix and more...it is notable for being fully object oriented, as opposed to an RDBMS. SQL users need not apply. :-)

Caches come in lots of lovely flavours. All of my nodes below apply to caches that sit between the CPU and the main memory, but the same sort of factors apply to other cache-types.

Where do we store stuff in the cache?

How do we write to the cache?
Intersystems Caché is:
  1. a post-relational database. Nobody seems to know what that means. It's better than the old-fashioned relational database model.
  2. The reincarnation of the mumps/M programming language
It has good support for sparse multidimensional arrays.

Intersystems Caché E2 Writeup, Copyright 2002 Frank Grimes.

This writeup is dedicated to the public domain. Do with it what you will. (For details, see http://creativecommons.org/licenses/publicdomain/ )

--Frank Grimes, 2007

Background

At the heart of most modern computers lies the central processing unit or CPU. It is quite often the fastest and most expensive part of the machine. The CPU operates on data and programs which reside in the computer's main memory chips. The CPU fetches programs from the main memory one instruction at a time and executes them. Quite often, the CPU will have to fetch additional information from the memory, such as data on which a calculation is to be performed. The CPU runs according to a clock. This clock sends out pulses. At each pulse, the CPU will go one step further in this fetch decode execute cycle.

From this very simple sketch, it should be clear that the communication with main memory needs to be very fast: it would be unacceptable if it took a CPU only one clock cycle to execute an instruction, but ten clockcycles to fetch the instruction from main memory. It would be possible to construct memory in such a manner that its access time is equal to that of the CPU's internal registers. This way, a fetch would take only one clock cycle. Unfortunately, such memory is very expensive. For reasons beyond the scope of this text, it would in addition be very complicated, if not impossible, to build a large memory like this. That is why normal computers use slow, cheap dynamic RAM for memory. This RAM is in fact about a factor ten slower than the CPU and it is the biggest bottleneck in modern computers. When buying a computer, it would thus be wise to check out its bus speed instead of the CPU speed.

Principle

To soften the problem outlined above, processors are equipped with a very small amount of expensive on-chip memory, called a cache. The word cache (pronounced 'cash') derives from the French word 'cacher' (to conceal). The cache effectively hides the slower main memory from the CPU. The cache keeps copies of small pieces of data that reside in the main memory. When the CPU requests the data in a certain memory location, the hardware first checks if a copy resides in the cache. If so, the cached data is used. If not, the hardware loads a region of memory into a slot in the cache.

Why cache improves speed

Computer scientists have observed long ago that the memory access behaviour of computer programs is far from random. The CPU does not request values from all over the memory. Instead, there are always some hot spots, some parts of memory that the CPU just keeps accessing repeatedly. This is not surprising since computers are programmed by people that like to organise data and computations into lists, variables, procedures, loops, etc. A program usually executes the same procedures repeatedly, accesses the same variables repeatedly, works its way through lists and pieces of code, etc. This phenomenon is known as locality of reference. In a sentence: locality of reference is the observed statistical property that a program is most likely to request data equal to or close to previously requested data.

With the property of locality of reference in mind, it makes sense to cache an entire region of memory when a location from that region is requested. With luck, the processor will access many locations in this region in the future. Since the cache memory is very fast, this strategy can increase the effective throughput of the CPU considerably.

Caching policy issues

Write Through versus Write Back

Obviously, when the CPU changes data in a cached region, it needs to be written back to main memory at some point in time. There are two policies that can be followed here. They are write back and write through. With write back, a region is marked as 'dirty' once its data have been altered. When the cache is full and regions need to be evicted, the region is written back to memory. If it was not dirty, it is just discarded, since the main memory holds the same copy.With write through, the region is written to memory as soon as it is altered. The former policy holds the best performance increase, since a region may be written many times over for the cost of only one complete memory transfer, whereas the latter costs one complete region transfer per write. In some situations, like multiprocessor environments or situations where consistency and robustness are issues, it might be necessary to write through.

Region replacement

When the cache memory fills up, regions need to be evicted to make room for new ones. The question arises which strategy should be used in order to make the best use of the cache memory. The problem is essentially the same as that of page replacement in a virtual memory system. Belady stated an algorithm in 1966 which Mattson proved to be optimal in 1970. That is, Belady's algorithm causes the least amount of cache misses of all conceivable algorithms. (A cache miss is when a piece of requested data is not in the cache and has to be fetched from memory. The opposite is called a cache hit.) The algorithm is an obscenely simple one: "evict the region that will not be used for the longest period of time". Unfortunately, this algorithm is not causal. It requires knowledge of future events, which makes it impossible to implement (at the time of writing). Many algorithms exist that approach the optimal algorithm, like the Least Recently Used (LRU) algorithm, which evicts the least recently used region, or the First In First Out (FIFO) algorithm and many others. Preferably, hardware designers would use LRU, because it is statistically the closest approximation of the optimal algorithm. Unfortunately, implementing LRU is rather expensive and a compromise is often chosen. To see why, read on!

Mapping

When a region is brought into the cache, there are two conflicting factors involved when choosing a place to store the region. Those factors are efficiency and money. First of all, we could have a fully associative mapping. This means that the region can be placed anywhere in the cache. This would permit an implementation of the LRU algorithm. However, such a cache would make it somewhat difficult to find a specific piece of data within the cache. Since any requested address can be anywhere in the cache, the hardware required to translate linear addresses requested by the CPU to cache locations needs to be fairly sophisticated, read: expensive.

Another strategy is to use direct mapping. In this scheme, each memory region can be in at most one location in the cache. Another way to look at this is that certain memory regions share the same slot in the cache. So if a certain region is in the cache, we always know exaclty where it is, since it has only one place to go. This makes for simple hardware, but it is extremely inefficient. When two regions share the same cache slot they have to be continually swapped in and out if the CPU requires both at the same time. This resembles the thrashing phenomenon found in virtual memory systems that employ a local replacement policy, which was described by Denning in 1968. It is especially unnecessary if the rest of the cache is fully available.

A compromise can be reached in the form of a set-associative mapping. In this scheme, each region has a limited amount (a 'set') of cache regions in which it can reside. The hardware thus doesn't have to look through the entire cache, because it knows there are just a few locations where the sought after region can reside. The first few bits of the absolute memory address are often used as a means to group regions. They are called the tag bits. This way, each chunk of memory maps onto its own set of cache regions, but within that region the mapping is fully associative.

Remark that both direct mapping and set-associative mapping inhibit the implementation of a full LRU replacement strategy. They are cheaper to implement however. An interesting way to look at this is that a direct mapping allows no policy at all. The replacement algorithm is trivial since each requested region has only one way to go. Thus, direct mapping requires no special policy hardware. It also implies one of the worst possible policies.

Region size

Another important design consideration for cache memories is the size of the cached regions. Again, many factors conflict. It is desirable to have a large region size because it increases the chances of another hit on that region. Large regions also make the controlling hardware considerably easier and cheaper. On the other hand, since cache space is limited, a small cache would be more desirable. Further, transferring regions to and from memory takes time; the longer the region, the more time it takes. In practice, finding a good region size is very much an empirical matter. The size of the cache itself is an economical one.

Leftovers

Multiple level caching

Effectively, a computer consists of layer upon layer of caches: the memory caches disk blocks, the on-chip cache caches memory, the registers cache the on-chip cache. There is really no limitation to add yet another layer of cache somewhere. For instance, between the on-chip cache and the main memory we could have another layer of memory that is faster and more expensive than main memory, but cheaper and slower than the on-chip cache. This too comes at an obvious price, but this kind of level 2 caching is quite common. So is level 3 caching. There are some natural limits however. At some point, building a large, fast memory will be cheaper than building dozens of layers of cache.

Translation Look-aside Buffers

In a virtual memory system, which will not be described in this node, a special kind of cache is used in conjunction with the normal cache described above. In a virtual memory system it is necessary to translate virtual addresses to physical ones. It is also necessary to check if a physical address is actually in memory or if it is residing on the hard disk. Because this translation takes time (clock cycles) it pays off to keep the information on recently accessed memory pages in a special cache: the Translation Look-aside Buffer or TLB. On exactly the same grounds of locality of reference, the CPU is likely to need access to the same page in the future as it has just accessed. This is why the page descriptor of such a page is kept in the TLB.

Some typical figures

If you have carefully read the above, you will have noticed that there is much talk about probability and little about certainty. This needs to be stressed: in most of the cases, caching increases speed and in most of the cases, LRU works best. However, these are not laws of nature! There are programs for which LRU performs very poorly, especially when traversing long lists, and there are situations in which caching is slower than direct memory access. These situations do occur in the field so to speak. Be warned. That aside: your average application will have just about 90 percent of cache hits for every memory reference on your average computer. Pretty good! The same holds for the memory by the way: every memory reference that the cache makes in a virtual memory system usually has about 90 percent of hits. The chance of going to the disk is thus very dim. In a properly equipped system that is. This is just to give an impression of course. Your mileage will definitely vary.

Remarks

Caching, and the problems involved, is very similar to the mechanics of paging in a virtual memory system and to the mechanics of block access on disks. In fact, the main memory itself is used for caching disk blocks by the disk driver, since disks are much slower than main memory. Each has its own problems and peculiarities however, but generally, this type of managing limitations in speed and capacity is what makes designing hardware and operating system software so complex, and challenging perhaps. Problems like these are found in operating systems on many, many levels. The manner in which the implementer deals with them makes up for a great deal of the speed and efficiency with which your machine operates. It also makes it very difficult do compare, say a Macintosh with a G4 processor running MacOS X to a PC with an Athlon running Linux.

Another thing to note about the LRU replacement algorithm described above, is that it requires the hardware to keep statistics on each region. It requires continuous knowledge of when each region was used. This causes considerable overhead. Many times a simple approximation is used that will only track the usage of each region for one to four cache misses. Still, updating all the statistics requires complex hardware.

Finally, the computer model described above had to be simplified somewhat. The process of fetching, decoding and executing is a fairly accurate representation of computing on a RISC architecture, like the PowerPC. On a CISC architecture, things are a bit more complex, since each instruction might execute a number of microinstructions, which in turn might access memory.

Per the OED:

"Fr. cache, f. cacher to hide".

A hidey-hole; a supply. A set of things stored for later use.


Is one supposed to have the items in one's cache?

Per M. M. Kaye's The Ordinary Princess: a cache of apples in one's closet. Such a cache, if you happen to be a princess, and thus almost certainly not living in the kitchen outside such a fairy tale, would not be thought proper or ladylike. Neither would sitting on the windowsill in your petticoat, eating an apple from such a cache, or throwing the gnawed core out said window at some completely innocent rabbits whom you would think a princess would at least consider cute.

O, no. No, no.

A set of things outside oneself. A set of extraordinary items. A princess might have apricots and coconut, oranges in midsummer and strawberries in fall. Apples? Common apples?


Is the cache then hidden, furtive?

"a. A hiding place, esp. of goods, treasure, etc."

Treasure. What kind of map is there to such treasure?

All you scoundrels sailing your tropical waters. All you chained captives, carrying chests inland, following a path hacked through by machetes. All you captains marking tanned hides or parchment with India ink in dotted lines, setting your bootprints backward to hide the trail, setting delicate snares along the path. All you sets of three brothers, small children, each with a tattoo thick across your back. Here are your clues. Can you put them together?

You are yourself stolen. You, bound and gagged, at the tip of the plank. Down under the water, you are a cache of bones. They won't come back for you this time.


What other type of items might be in such a cache?

"b. esp. A hole or mound made by American pioneers and Arctic explorers to hide stores of provisions, ammunition, etc."

So. You watch for the redcoats, do you.

Hello Lexington, hello Concord. Gunpowder and flintlock, minutemen, late night muster at the Old North Bridge.

No one will get through you. There will be no one rifling through the storehouse, setting fire to it from a safe distance. No vast explosion in the early morning. No heat waves coming off the shattered building, no scorch marks on the cobblestones. No barrels taken, no muskets stolen.

It is quiet, for the moment. You keep a close watch. Wrapped up in your own icy breath at the side of the road, hidden in the hedge, behind a stand of trees. Are those footsteps off in the distance?

"cache-cahce" -- Fr. Hide and seek

Motivation for yet-another-cache-entry

Above are fine nodes describing the detail, implementation, book-stuff about caches. This addition is to complement those. To supply extra bits that you'll never get asked in an exam (unless you're a student of mine).

From where? Background ...

So, you're smart enough to read 'cache' as derived from the French verb cacher "to hide". It's quite literal. A hidden stash of good-stuff. I won't node the context like "IRA Arms cache", just the computer architecture version. Not even the myriad of places that caches appear in computers. Just the particular location between the doing-it bits (I'll get more technical if need be later) that form part of the processor (CPU) and the storing-it bits you call memory (RAM).

Caching is a philosophy. Usenetters' have been known to .sig with "Everything is an exercise in caching". Caches slot into the first of the rules of optimisation of M.A. Jackson (Principles of Program design 1975) -- "Don't do it". That is, rather than spend the enormous cost of (in this case) going to the storing-it bits to find something out, like the value of a, you magically (well, not quite magic) have it, to hand, right this instant (or at least in comparison to the shopping trip to memory). Caches are, by their definition, hidden. They are transparent. You think that you are using memory. It looks and feels like you are using memory. The only difference is that memory looks and feels orders of magnitude faster. How? See below. This is about history, etymology, and things pomp like that.

If you came here to compare what I've noded to the content of Hennesy & Patterson, then I hope you'll be surprised. If you already know the intimate details of a (deep breath) multi-block line no-allocate-on-write-miss write-back victim-cache with critical-word-first-line-fill then I believe there's still plenty for you to find out.

Hennessy & Patterson, among other popular (read American) computer architects, cite Wilkes' paper of 1965 as the first paper describing caches. It was, by modern terminology, a direct-mapped cache. If necessary, I'll get onto the detail of something-something cache later. It's just detail, others cover it well enough above. Unstimulating implementation issues that you can find in any old school book. More importantly, three years prior, a few people at National Cash Register (NCR) had, in fact, made and published (Proceedings on Giga-cycle computing) a fully-associative cache. There is one technical detail regarding their implementation I'd like to point out, due to pure cuteness. It really sets the era. If these words make no sense now, ignore them. They are simply romantic longing for the dark ages. The replacement policy for the cache was real LRU (Least Recently Used). In modern cache implementations, this is rare, as is the presence of a fully associative cache. Why that is so, is common copy in every architecture book. Basically expense. I'm diverging. It is how they implemented the LRU mechanism I love. Each cache line (OK, so the implementation has a single word line length) had an analogue "How recently was this word used" metric. Basically, a battery, a capacitor (or something made of metal bits, I'm a s/w & conceptual computer architect) attached to a light-emitting diode (LED) that drained the charge over time. Each use of the word charged the battery. This meant you could physically see the cache doing it's stuff. Oh, I love flashing lights.

Thus far, we've covered when caches happened. At least in the context of the beginning of their common context's origin. There are literally dozens of variations on the original schemes -- or implementation. It's not interesting to know or understand the difference between each of them. To list a couple of names for further reader research may be more use: Victim caches (Jouppi); Column-associative caches; Stream buffers / Burst-Buffers (McCarthy); Set-associative caches. The two original cache devices (direct-mapped and fully associative) are variations on the generalisation that is set-associative. Interestingly enough, they are polar opposites in the general mechanism's range.

Yet, so far, we've not given the etymology. This is a favourite. It goes like this:

If Jack Gibson (of IBM fame, did things like "The Gibson Mix" that should make you think RISC) and friends had been too big headed and brow-beat the editor of the "IBM Systems Journal" in the mid '60s, this node would have been called "muffer". Hmmm. That would have aged as well as "bonk" from comics like The Beano and The Dandy.

Gibson and his pals at IBM were describing the IBM System 360 (model 85 to be exact, that's just superfluous). They'd put a memory buffer -- hence muffer -- to store recent accesses to core memory. This was when IBM were wondering what to use for core memory, using ICs -- yes, integrated circuits -- or the ferrite ring stuff. Rappa covers this transition well.

How's it work? What's it do?

We know when it was first described and named. We know why we have them (first rule of optimisation). It makes your machine appear much faster than a cursory look at the specifications of it's memory system would have you believe. Do you care why and how it works? If you do, there's more to read.

The best thing a memory device (where you're stashed all your great data) could do for your processor (specifically the load part of the load/store unit) is to have it ready and waiting exactly when you need it. That is, at some time your little assembler LOAD instruction will be executed. It could take your memory chip, ooh, a hundred times as long as it takes to do the thing you're going to do with it. I dunno, like add something to it. If you could have started preparing the extraction of the datum ninety-nine clock-cycles previous to the LOAD then there's no unbearable stall while you wait for your precious item to arrive.

So why don't we have memory systems or gadgets that do this? There are lots of things to talk about in computer architecture that aim to reduce the impact of memory transaction delay. Things like: pre-fetching or touch instructions; scheduling (multi-processing or multi-threading) on memory transactions; super-scalar architectures; posh memory controllers like Impulse (UTAH University, Sally Mackee et al.) and just about most things in computer architecture for the last, oh, thirty years have been about this. Heck, even multi-processors, processors-in-memory and their ilk have been about putting the doing-it-bits physically closer -- and hence with quicker access to -- storing-it-bits: memory.

We have caches because we can't afford fast memory. If you have a Cray, you'll have bucketloads of trapping-your-knickers-in-the-crack-of-your-@$$ SRAM -- really fast memory. Your performance problems will be very well understood and you'll have a bunch of highly qualified people working at getting your weather forecasted, your car crashing (simulated of course) and your high energy physics (that's nukes to you) going faster and faster. Not everyone has such a beast of a computer. Still, devices like Impulse are aimed at making these behemoths go even faster.

We're trying to get the right bits of memory to the processor at the right time. Typically, the right time is now and the right bit of memory is not as predictable as we'd always like. That's the result of letting users pick the video or MP3 they'd like played. We do have one, leveragable advantage: locality. Locality comes in two main flavours (they're different perspectives on the measurement) called spatial and temporal. There are lots of sub-categories, but they refer to application source code and require a lot more discussion than necessary. Simply put, spatial-locality is the absolute difference between the current required memory location (the physical address) and those that were requested previously. Temporal locality is the difference in time (you can re-time a clock to tick once every memory transaction) between successive references to the same physical memory location. You can measure locality. Until recently (MacKee et al. 1998) researchers used cosmological constants, picking their favourite Greek letter, calling it locality and then giving it some value that they thought fitting. It was all pretty poor. From an analytical perspective that is. There's talk about modelling later. Anyway, you can measure this stuff in a well defined way, that makes real sense. You can measure instantaneous-locality (and instantaneous-hit-rate) and plot it (them) on a graph to make a nice curves (OK, not real curves, they're discrete, let it slide). Most computer applications exhibit a lot of exploitable locality. All locality is exploitable, it just costs more and more money (effort, chip area, etc.) to realise every little bit.

Computer programs, in general (ooh, I hate that), tend to use the same bits of memory or memory nearby over and over again. It's due to the way programmers typically (there it is again) write iterative, compositional solutions to your every need. This offers the instant, lazy-hack solution to the crystal-ball requiring go "fetch in advance those memory bits the application needs". The implementation of caches simply keeps those things you've paid to load from memory around along with it's neighbours. That is, caches simply retain memory contents that you have already seen and those that are nearby (in memory address space terms). Using the phenomenon of exhibited locality in computer applications, the spatial and temporal similarity in the memory requests can very frequently be supplied by the collection retained in the cache. For instance, in the SPEC benchmarks, cache hit-rates for every component in the benchmark suite are over 90% and remain so after the first second of execution. That's it. That's all caches do. They keep the things you've asked for already along with some nearby items.

Bonus material

Caches are bad. Well, they are more of a problem than a sin. Relying on a cache to deliver performance is, unless you really know what you're doing, foolish. There's lots of research into making their performance predictable. Lots of maths and sometimes funky hardware and software.

Regular computer architecture proposals are about squeezing another half a percentage point out of the latest SPEC performance. All in all, not very exciting.

Caches also cost a lot. You'll find them eating up to two thirds of the die (that's the real estate on a chip) on modern microprocessors. This burns a criminal quantity of electricity (power) and generates tonnes of heat and radio racket (interference). Microprocessor designers find these qualities sufferable given the general performance enhancing kick they give.

It's pretty easy to write applications that perform badly, even if you have the greatest cache in the world. Pointer chasing is what kills caches. Recursive data types and object oriented programming both result in pointer chasing. State machines (like emulators, virtual machines, bits of compilers) have poor cache profiles too. Not because they don't exhibit any locality, but because exploiting it often requires too big a cache or very long run-times. In fact, most short programs get little benefit from a cache. It's the implementation of caches -- the keep what we've already paid for -- that doesn't work in these case.

And, as a simple trailer quest, I've never known of a Harvard (segregated instruction and data) cache beyond level one.

Cache (?), n. [F., a hiding place, fr. cacher to conceal, to hide.]

A hole in the ground, or hiding place, for concealing and preserving provisions which it is inconvenient to carry.

Kane.

 

© Webster 1913.

Log in or registerto write something here or to contact authors.