In a CPU, the pipeline is another name for the execution path. Essentially, it is a list of stages through which the processor goes to get something done. Deeper pipelining allows you to make a faster CPU, but if you make a bad branch prediction (Related to caching) then a deeper pipeline means a greater penalty for the mispredicted instruction. Other instructions in the pipeline march along (with potential mispredictions themselves) and the mispredicted instruction is inserted at the head of the queue. The later your branch check is in your pipeline, and it must be at or near the end, the more cycles you're going to have to wait to get back the right answer after a misprediction.
If you want to increase your clock rate, and thus the number of operations executed per second (or other unit of time) you have two choices. Either you make your operations take less time, or you increase the number of stages in your pipeline. Either way, each stage of your pipeline must take one cycle to complete. Hence, breaking up an operation into two (or more) stages and adding steps to your pipeline allows you to increase the speed of your CPU, with the caveat that bad branch prediction costs you more with a deeper pipeline. Making a stage accomplish less makes it take less time - this seems straightforward, but why? It's because transistors have a switching time, an amount of time which it takes them to change state. The more transistors you have leading into other transistors, the longer it will take them to all fall into line.
The classic four-stage pipeline taught in processor design courses goes something like this:
+-------+
| Cache |
+-------+
|
V
+-----------------+
| Front End |
| (Decode) |
+-----------------+
|
V
+---------------------------+
| Execute |
| +-----+ +-----+ +-----+ |
| | Ex. | | Ex. | | Ex. | | <- Execution Units
| +-----+ +-----+ +-----+ |
+---------------------------+
|
V
+---------+
| Retire |
+---------+
You fetch the instruction from the cache, decode it so that the internals of the processor can do something about it, execute the operation (actually do the work) and then write the results somewhere. Even in a simple CPU like this you can have multiple instructions in the CPU at once, each one going through a different stage of execution.
The most important thing to understand when thinking about pipelines is that one stage of the pipeline is completed per cycle. In a pipelined processor you don't just take in an instruction which requires two cycles on the first cycle and get the answer back on the third; In fact, you're not going to get it out any sooner than the twenty-third on a Pentium 4. This is less important than you might think since the P4s are closing in on three gigahertz these days, and the CPU does handle putting things back in order if they become disordered -- and they will. The point is that each stage can take no longer than one clock cycle to complete or you are going to slow down the entire queue as operations bunch up.
P4 PIPELINE
The Pentium 4's pipeline is interesting because it is so very long; The P6 core (Which lives on essentially though not literally unchanged in the Pentium Pro, Pentium 2, and Pentium 3) uses a ten stage pipeline, while the Pentium 4 uses a twenty stage pipeline. This is why we have near-three gigahertz pentium 4s now, but it's also why a P4 is so much less efficient than a Pentium 3, clock for clock.
The P4's pipeline looks like this:
- Trace Cache next Instruction Pointer
- Trace Cache next Instruction Pointer
- Trace Cache Fetch
- Trace Cache Fetch
- Drive
- Allocate and Rename
- Allocate and Rename
- Allocate and Rename
- Queue
- Schedule
- Schedule
- Schedule
- Dispatch
- Dispatch
- Register Files
- Register Files
- Execute
- Flags
- Branch Check
- Drive
Wow, that sure does look impressive. What the heck does it all mean?
- Trace Cache next Instruction Pointer: The trace cache fetch logic gets a pointer to the next instruction in the trace cache. Trace cache is intel's name for putting the L1 cache inside of the first functional unit (Where the decode step is in the four-stage model) for speed.
- Trace Cache Fetch: We use our pointer to fetch an instruction from the cache.
- Drive. In this stage we are wasting time while signals propagate throughout the CPU. This is analogous to not calling your insurance company to see if they have your payment before the mail has time to get to them and get opened up. This is apparently necessary when you start getting into serious multiple-gigahertz speeds.
- Allocate and Rename: The CPU actually contains more registers than are related in the specification in order to speed things up and be able to execute operations in a superscalar fashion (which is to say, more than one operation at once.) At this time, the CPU will associate different registers with the names of the registers.
- Queue: Operations are now placed into either the memory queue or the arithmetic (everything else) queue for scheduling.
- Schedule: In a superscalar processor, operations are often executed out of order so that they do not step on each other and so that they are completed as rapidly as possible. The P4 has four queues; Memory, Fast ALU, Slow ALU/General FPU, and Simple FP. All instructions get dumped into one of them for later execution by the appropriate (and linked) functional unit.
Operations in each queue are sorted based on the order they were submitted, what instructions are waiting on them, and the number of cycles required by an ALU (or FPU, or LSU) to complete the instruction.
- Dispatch: Instructions are moved from the queues to the functional units.
- Register Files: The instructions are now loaded into the functional units (think rank and file, not file folder for the meaning of "file" here) for actual execution.
- Execute: The functional units process the instructions in the files along with data in the registers. This is the seventeenth stage! A lot has happened before we got here.
- Flags: There is a status register (sometimes called a flag register) in all CPUs of the x86 family (and most others) which is used for conditional jumps. The flags are set in this stage.
- Branch Check: Now in the nineteenth out of twenty stages we finally check to see if the branch predictor predicted incorrectly and we have to discard some operation we have just spent eighteen stages (and a cycle) on.
In other words, the Pentium 4 is split up into 20 very short pipeline stages. Some of them are so short that they aren't long enough to fit an entire function, and so that function actually takes two stages. Chopping execution up into so many short stages means that very high clock rates can be achieved, but if your branch prediction is off (IE, if what you grab out of cache isn't really what you wanted) then that operation is now going to take you another twenty cycles to execute, after already having taken you 19 cycles to get to the branch check.
References:
- http://arstechnica.com/cpu/3q99/k7_theory/k7-one-1.html
- http://www.anandtech.com/showdoc.html?i=1360&p=3
- http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-6.html