"cache-cahce" -- Fr. Hide and seek

Motivation for yet-another-cache-entry

Above are fine nodes describing the detail, implementation, book-stuff about caches. This addition is to complement those. To supply extra bits that you'll never get asked in an exam (unless you're a student of mine).

From where? Background ...

So, you're smart enough to read 'cache' as derived from the French verb cacher "to hide". It's quite literal. A hidden stash of good-stuff. I won't node the context like "IRA Arms cache", just the computer architecture version. Not even the myriad of places that caches appear in computers. Just the particular location between the doing-it bits (I'll get more technical if need be later) that form part of the processor (CPU) and the storing-it bits you call memory (RAM).

Caching is a philosophy. Usenetters' have been known to .sig with "Everything is an exercise in caching". Caches slot into the first of the rules of optimisation of M.A. Jackson (Principles of Program design 1975) -- "Don't do it". That is, rather than spend the enormous cost of (in this case) going to the storing-it bits to find something out, like the value of a, you magically (well, not quite magic) have it, to hand, right this instant (or at least in comparison to the shopping trip to memory). Caches are, by their definition, hidden. They are transparent. You think that you are using memory. It looks and feels like you are using memory. The only difference is that memory looks and feels orders of magnitude faster. How? See below. This is about history, etymology, and things pomp like that.

If you came here to compare what I've noded to the content of Hennesy & Patterson, then I hope you'll be surprised. If you already know the intimate details of a (deep breath) multi-block line no-allocate-on-write-miss write-back victim-cache with critical-word-first-line-fill then I believe there's still plenty for you to find out.

Hennessy & Patterson, among other popular (read American) computer architects, cite Wilkes' paper of 1965 as the first paper describing caches. It was, by modern terminology, a direct-mapped cache. If necessary, I'll get onto the detail of something-something cache later. It's just detail, others cover it well enough above. Unstimulating implementation issues that you can find in any old school book. More importantly, three years prior, a few people at National Cash Register (NCR) had, in fact, made and published (Proceedings on Giga-cycle computing) a fully-associative cache. There is one technical detail regarding their implementation I'd like to point out, due to pure cuteness. It really sets the era. If these words make no sense now, ignore them. They are simply romantic longing for the dark ages. The replacement policy for the cache was real LRU (Least Recently Used). In modern cache implementations, this is rare, as is the presence of a fully associative cache. Why that is so, is common copy in every architecture book. Basically expense. I'm diverging. It is how they implemented the LRU mechanism I love. Each cache line (OK, so the implementation has a single word line length) had an analogue "How recently was this word used" metric. Basically, a battery, a capacitor (or something made of metal bits, I'm a s/w & conceptual computer architect) attached to a light-emitting diode (LED) that drained the charge over time. Each use of the word charged the battery. This meant you could physically see the cache doing it's stuff. Oh, I love flashing lights.

Thus far, we've covered when caches happened. At least in the context of the beginning of their common context's origin. There are literally dozens of variations on the original schemes -- or implementation. It's not interesting to know or understand the difference between each of them. To list a couple of names for further reader research may be more use: Victim caches (Jouppi); Column-associative caches; Stream buffers / Burst-Buffers (McCarthy); Set-associative caches. The two original cache devices (direct-mapped and fully associative) are variations on the generalisation that is set-associative. Interestingly enough, they are polar opposites in the general mechanism's range.

Yet, so far, we've not given the etymology. This is a favourite. It goes like this:

If Jack Gibson (of IBM fame, did things like "The Gibson Mix" that should make you think RISC) and friends had been too big headed and brow-beat the editor of the "IBM Systems Journal" in the mid '60s, this node would have been called "muffer". Hmmm. That would have aged as well as "bonk" from comics like The Beano and The Dandy.

Gibson and his pals at IBM were describing the IBM System 360 (model 85 to be exact, that's just superfluous). They'd put a memory buffer -- hence muffer -- to store recent accesses to core memory. This was when IBM were wondering what to use for core memory, using ICs -- yes, integrated circuits -- or the ferrite ring stuff. Rappa covers this transition well.

How's it work? What's it do?

We know when it was first described and named. We know why we have them (first rule of optimisation). It makes your machine appear much faster than a cursory look at the specifications of it's memory system would have you believe. Do you care why and how it works? If you do, there's more to read.

The best thing a memory device (where you're stashed all your great data) could do for your processor (specifically the load part of the load/store unit) is to have it ready and waiting exactly when you need it. That is, at some time your little assembler LOAD instruction will be executed. It could take your memory chip, ooh, a hundred times as long as it takes to do the thing you're going to do with it. I dunno, like add something to it. If you could have started preparing the extraction of the datum ninety-nine clock-cycles previous to the LOAD then there's no unbearable stall while you wait for your precious item to arrive.

So why don't we have memory systems or gadgets that do this? There are lots of things to talk about in computer architecture that aim to reduce the impact of memory transaction delay. Things like: pre-fetching or touch instructions; scheduling (multi-processing or multi-threading) on memory transactions; super-scalar architectures; posh memory controllers like Impulse (UTAH University, Sally Mackee et al.) and just about most things in computer architecture for the last, oh, thirty years have been about this. Heck, even multi-processors, processors-in-memory and their ilk have been about putting the doing-it-bits physically closer -- and hence with quicker access to -- storing-it-bits: memory.

We have caches because we can't afford fast memory. If you have a Cray, you'll have bucketloads of trapping-your-knickers-in-the-crack-of-your-@$$ SRAM -- really fast memory. Your performance problems will be very well understood and you'll have a bunch of highly qualified people working at getting your weather forecasted, your car crashing (simulated of course) and your high energy physics (that's nukes to you) going faster and faster. Not everyone has such a beast of a computer. Still, devices like Impulse are aimed at making these behemoths go even faster.

We're trying to get the right bits of memory to the processor at the right time. Typically, the right time is now and the right bit of memory is not as predictable as we'd always like. That's the result of letting users pick the video or MP3 they'd like played. We do have one, leveragable advantage: locality. Locality comes in two main flavours (they're different perspectives on the measurement) called spatial and temporal. There are lots of sub-categories, but they refer to application source code and require a lot more discussion than necessary. Simply put, spatial-locality is the absolute difference between the current required memory location (the physical address) and those that were requested previously. Temporal locality is the difference in time (you can re-time a clock to tick once every memory transaction) between successive references to the same physical memory location. You can measure locality. Until recently (MacKee et al. 1998) researchers used cosmological constants, picking their favourite Greek letter, calling it locality and then giving it some value that they thought fitting. It was all pretty poor. From an analytical perspective that is. There's talk about modelling later. Anyway, you can measure this stuff in a well defined way, that makes real sense. You can measure instantaneous-locality (and instantaneous-hit-rate) and plot it (them) on a graph to make a nice curves (OK, not real curves, they're discrete, let it slide). Most computer applications exhibit a lot of exploitable locality. All locality is exploitable, it just costs more and more money (effort, chip area, etc.) to realise every little bit.

Computer programs, in general (ooh, I hate that), tend to use the same bits of memory or memory nearby over and over again. It's due to the way programmers typically (there it is again) write iterative, compositional solutions to your every need. This offers the instant, lazy-hack solution to the crystal-ball requiring go "fetch in advance those memory bits the application needs". The implementation of caches simply keeps those things you've paid to load from memory around along with it's neighbours. That is, caches simply retain memory contents that you have already seen and those that are nearby (in memory address space terms). Using the phenomenon of exhibited locality in computer applications, the spatial and temporal similarity in the memory requests can very frequently be supplied by the collection retained in the cache. For instance, in the SPEC benchmarks, cache hit-rates for every component in the benchmark suite are over 90% and remain so after the first second of execution. That's it. That's all caches do. They keep the things you've asked for already along with some nearby items.

Bonus material

Caches are bad. Well, they are more of a problem than a sin. Relying on a cache to deliver performance is, unless you really know what you're doing, foolish. There's lots of research into making their performance predictable. Lots of maths and sometimes funky hardware and software.

Regular computer architecture proposals are about squeezing another half a percentage point out of the latest SPEC performance. All in all, not very exciting.

Caches also cost a lot. You'll find them eating up to two thirds of the die (that's the real estate on a chip) on modern microprocessors. This burns a criminal quantity of electricity (power) and generates tonnes of heat and radio racket (interference). Microprocessor designers find these qualities sufferable given the general performance enhancing kick they give.

It's pretty easy to write applications that perform badly, even if you have the greatest cache in the world. Pointer chasing is what kills caches. Recursive data types and object oriented programming both result in pointer chasing. State machines (like emulators, virtual machines, bits of compilers) have poor cache profiles too. Not because they don't exhibit any locality, but because exploiting it often requires too big a cache or very long run-times. In fact, most short programs get little benefit from a cache. It's the implementation of caches -- the keep what we've already paid for -- that doesn't work in these case.

And, as a simple trailer quest, I've never known of a Harvard (segregated instruction and data) cache beyond level one.