How in the world do those super complicated processors work? They keep getting faster and faster and more and more complicated. Well this is my attempt to shine a little bit of light on one small small aspect of advanced computer architecture.

Trace Cache is a specialized type of cache memory recently developed and used in most modern processors (Pentium 4, Athlon, Sparc). You may want to have some background information on cache, memory hierarchy, and assembly language before tackling this node.

Normal Cache
Normal cache basically is a stepping stone between main memory and the CPU registers. The cache keeps memory values stored in it, close to the processor, so that if you keep using the same memory values over and over again they will be readily available. There are two basic types of cache, instruction cache and data cache. Although cache is used all over the CPU in places you wouldn't expect (BTB, Branch Predictor, TLB). Trace cache mainly deals with instructions, making it a type of instruction cache. (See cache, wu's for more information)

Basics
The big problem in processor design, outside of slow memory access, is branching. Assembly code runs one line after another, in sequential order, until a branch is taken. When this happens the current line being executed jumps to a different part of the code.

ASSEMBLY CODE
(see x86 assembly for more info.)

address    instruction
------------------------
000F:0001   AND  EBX,ECX
000F:0002   DEC  EAX                           
000F:0003   CMP  EAX, EBX                      
000F:0004   JL   0007 >---------+              
000F:0005   PUSH EAX            |BRANCH        
000F:0006   JMP  0001           |TAKEN!        
000F:0007   PUSH EBX  <---------+              
000F:0008   XOR  EAX, EBX                      


at address 000F:0004 the code can either jump down to 0007
OR keep going down to 0005 depending on the values EAX and
EBX in line 0003.  Don't worry if you don't understand the
code, just understand that it can jump.

Processors try to get around this by predicting the branch. They use past history and a special 2-bit counter to try and guess which way the branch is taken. That way they know where the branch is going, and they can get ready for a jump (or no jump). Because a misprediction empties the pipeline, there is a large performance penalty when the processor cannot pre-fetch instructions.

Okay, whew. Got all that? Now the actual trace cache.

Lots of work has been put in by lots of smart people to develop these really good branch predictors. At some point, someone somewhere said "Hey!, lets use these branch predictor thingies to make ourselves a better instruction cache!". Conventional instruction caches go to memory and grab large chunks of instructions at a time. This works well when there are no jumps. A Trace Cache will use branch prediction techniques to dynamically fetch instructions from memory according to the anticipated flow of the code. In other words, the cache will look at branch predictions for up coming branches and if they are taken, it will fetch the instructions from the taken branch ahead of time. This way the trace cache will have all the correct code as long as the branch predictor was right. If the branch predictor was wrong then you have to stall and go back to memory to look for the correct instructions. So imagine we are running the code above again, and it has already looped once.

address    instruction
------------------------
000F:0001   AND  EBX,ECX <---------------------+
000F:0002   DEC  EAX                           |
000F:0003   CMP  EAX, EBX                      |
000F:0004   JL   0007 >---------+              |
000F:0005   PUSH EAX            |BRANCH        |
000F:0006   JMP  0001           |TAKEN!        |
000F:0007   PUSH EBX  <---------+              |
000F:0008   XOR  EAX, EBX                      |
000F:0009   ADD  EAX, 1                        |
000F:000A   PUSH EAX                           |
000F:000B   MOV  EAX,ESI                       |
000F:000C   JMP  0001  >-----------------------+

The flow of the code (by address) would be 1,2,3,4,7,8,9,A,B,C if the branch on line 4 is always taken. a typical cache would grab a chunk of memory in sequential order (ie 1,2,3,4,5). Lets say with chunks 5 instruction big. Now the Branch Target Buffer should have the correct branch prediction for 0004 (predict a jump). The trace cache could load up the following instructions (basic blocks) {1,2,3,4,7} and {8,9,A,B,C}. Which would provide all the correct instructions for the CPU. The normal instruction cache, however, would have {1,2,3,4,5} and {6,7,8,9,A} which doesn't have all the neccessary instructions and is in the wrong order-- generating a cache miss! which is SLOW!

Variations
The Trace Cache is a simple concept (for computer architects), that can be difficult to implement and possible to do in many different ways. Because of this there are many variations on the basic concept.

Most trace caches have a basic block. The basic block in the basic amount of memory fetched as in the example above {}. The basic block is a certain predefined # of instructions that are fetched from memory. Too big and there is a performance hit when you're wrong. Too small and there isn't enough benefit from predicting the instruction flow.

Many caches today will discard an entire block if the branch prediction is wrong. This could result in the throwing away of good code. If 123 is predicted and 124 actually happens, you *should* be able to use instructions 1 and 2. This is partial block usage which some designs support.

Another possibility used today is to have unique identity information about each basic block. This information is used, along with a buffer to track history, in order to predict which basic block will arrive next. This allows for yet another level of prediction creating an advanced chain of instructions, all pre-fetched and ready to go.

Drawbacks
Anytime you add functionality to a processor, you make it bigger. Trace cache can add a lot of transistors to an already large part of the processor. This takes up valuable real estate on the silicon wafer. The size of a processor is directly proportional to how much it costs to make it. Because of this, implementing a trace cache will make a processor more expensive (yet hopefully faster).

Also, we are adding more prediction to the processor. It works well when we predict correctly but when we are wrong, there is an added performance hit. If you poorly implement a trace cache, and it misses a lot, then it may actually be slower than a regular cache.

Log in or register to write something here or to contact authors.