The "classic" control scheme for out-of-order superscalar processor micro-architectures, Tomasulo's Algorithm provides a framework for organising and coordinating the simultaneous execution of independent instructions on multiple execution units, with dynamic scheduling around data hazards and structural hazards such as multi-cycle pipelines, as well as cache misses.

Many modern microprocessors are designed around variations on Tomasulo's Algorithm, perhaps most notably Andy Glew's P6 core, found in the Pentium Pro, Pentium II, Pentium III and Pentium M. Other implementations include most PowerPC processors, and famously the first few generations of the DEC Alpha.

The first implementation of Tomasulo's algorithm was the floating point unit of the IBM 360/91, and the first entire processor implementing it was the CDC 6600, both mainframe processors.

The Problem

A microprocessor will have a variety of different functional units, corresponding roughly to the variety of different instructions found in the instruction set architecture it implements: memory accesses, simple integer arithmetic, multiplication, floating point arithmetic and so on. In a simple, "scalar" pipelined microprocessor, only one instruction is actually executed at a time (although others may be in the process of being fetched from instruction memory, having operands fetched from registers, and so on), and so most of the actual functional units will be idle for most of the time. Furthermore, where functional units take more than a single cycle to compute their results, they can potentially hold up following instructions unnecessarily.

So, it's desirable to be able to make use of these idle units, and execute more than one instruction at the same time to improve performance, without adding the cost of any more execution units. So long as successive instructions don't rely on each others results too often (ie there is a sufficient amount of instruction level parallelism), we should see a reasonable improvement in performance.

This sounds all very reasonable so far, and in fact is the very key idea in superscalar processors. The trouble is that providing the ability to allow multiple instructions to execute at once, in a manner that actually allows simultaneous execution to occur a useful proportion of the time turns out to be really quite complex.

Operand dependencies between instructions restrict which instructions can execute at the same time. Instructions which can take more than one cycle to execute (for example, multiplications, memory accesses and floating point operations) make things far more complex to organise since they can affect scheduling of instructions in more than one way: resolving a read-after-write dependency requires stalling the second instruction until the first instruction completes its result, but a write-after-read requires only that the first instruction be started before the second instruction is completed.

There are a multitude of issues to contend with, so much so that attempting to arrange the control of a superscalar processor in an ad-hoc manner is fraught with danger and almost certain to result in a subtly buggy implementation, be difficult to program efficiently, and so largely doomed to failure in the commercial world.

Tomasulo's Algorithm provides a robust, correct algorithm for the superscalar execution of a sequential instruction stream, which can be realistically implemented in hardware, and which can expose a reasonable and useful amount of instruction level parallelism from a program.

The Algorithm

The basis of the algorithm is, in a quick summary form: instructions are "issued" (a process which also involves register renaming) to "reservation stations" attached to functional units, in program order; the reservation stations decide when to execute the instructions based on the availability of their operands, and broadcast the results of the operations to all the other functional units as they are calculated.

So, let's follow the life of an instruction through the stages of a typical Tomasulo's processor. Note that each "stage" here may correspond to more than one pipeline stage:


Just like in a simple scalar processor, instructions are fetched from instruction memory. Unlike a simple scalar processor, instructions are likely to be fetched in batches of two or more at a time.


This is where things get interesting. In the issue stage, the processor works out which reservation station to issue the the instruction to, based on what type of functional unit it requires, and the availability of spaces in the reservation stations.

Instructions can be issued to reservation stations regardless of whether or not their operands are available. If an operand isn't available to be read from the register file immediately, this must be because the value associated with that register has not yet been calculated. If this is the case, then it will be written with the result of an instruction which has already been issued, and is therefore assigned to a reservation station.

Hence, as the issue unit issues the instruction to its reservation station, it renames any references to outstanding registers with an identifier tag which indicates which functional unit will produce the result, and which virtual register identifier the result will be identified as. The issue unit also renames any result registers associated with the new instruction to a virtual register identifier so that it can tell subsequent instructions where to find the results of this instruction.


The execution stage of a processor implementing Tomasulo's algorithm consists of a number of functional units, each with their own reservation station. The reservation station holds a small number of instructions which have been issued, but cannot yet be executed. When an instruction becomes ready to be executed because of operand values becoming available (by being broadcast on the common data bus... more on that later), and the functional unit is ready to accept a new instruction, the reservation station passes the instruction to the functional unit, where the instruction's real execution takes place: arithmetic operations are calculated, memory is accessed. You know, the real work.

The final stage an instruction goes through is writeback. This is similar in many ways to a simple pipelined machine: when the result of an instruction has been calculated, the value is driven on one of a number of common data buses, to be sent to the register file. Much like a bypass, this bus is also monitored by other parts of the machine so that the value may be used immediately by waiting instructions.


Let's try a small example now to see how this works in practice. We'll try a very simple bit of code, which simply computes the factorial of its input:

factorial(int x) {
		int f = 1;
		f = f * x;
		x = x - 1;
		if (x >= 2) goto loop;
		return f; }

The above is C code, but extremely simple C, such that each statement will likely correspond to a single machine instruction, much like GIMPLE.

For the purposes of keeping the example simple, we'll assume that our processor has the bare minimum set of functional units necessary to compute this function: a multiplier, and an arithmetic unit capable of subtraction and comparison (which is, afterall, just subtraction). We'll assume a realistic set of times for these units, too: the arithmetic unit will take a single cycle, whereas the multiplier will take three (which is a reasonable figure for most modern processors). We'll also assume, to keep things nice and simple, that we have an ideal branch predictor which will always fetch the correct instruction. To keep things nice and brief, we'll set the initial value of 'x' to 3, implying two iterations of the loop.

First, let's try executing this on a processor using a simple scalar control model, with the execution units grouped together into a single 'Execute' stage. The pipeline looks a little like this:

        Fetch -> Issue -> Execute -> Writeback

and so as we execute the code, the following instructions are in each pipeline stage in each cycle:

Cycle   Issue unit           Execute unit
1       f=1;              
2       f=f*x;               f=1;
3       x=x-1;               f=1*3; {1}
4       x=x-1;               f=1*3; {2}
4       x=x-1;               f=1*3; {3}
5       if (x>=2) goto loop; x=3-1;
6       f=f*x;               if (2>=2) goto loop;
7       x=x-1;               f=3*2; {1}
8       x=x-1;               f=3*2; {2}
9       x=x-1;               f=3*2; {3}
10      if (x>=2) goto loop; x=2-1;
11      return f;            if (1>=2) goto loop;
12                           return 6;

if (x>=2)...

So, we can see that it took approximately 12 cycles to compute that factorial. During these 12 cycles, the multiplier is used only 50% of the time, despite holding up following instructions 33% of the time. Let's try Tomasulo's algorithm instead. To make it a fair comparison, we'll keep the same fetch unit, and so we'll still issue only a single instruction per cycle. We rearrange the pipeline slightly to look like this:

                        .-> Arithmetic -.             
        Fetch -> Issue -|               |-> Writeback 
                        '-> Multiplier -'             

Note that we haven't added any new execution resources: we've just rearranged the pipeline around the execution units that were there already. Using Tomasulo's Algorithm, our new pipeline will execute the factorial code as follows:

Cycle  Issue unit           Arithmetic unit      Multiplier          Writeback
1      f=1;
2      f=f*x;               f=1;
3      x=x-1;                                    f=1*3; {1}          f=1;
4      if (x>=2) goto loop; x=3-1;               f=1*3; {2}
5      f=f*x;               if (2>=2) goto loop; f=1*3; {3}          x=2;
6      x=x-1;                                    f=3*2; {1}          f=3;
7      if (x>=2) goto loop; x=2-1;               f=3*2; {2}
8      return f;            if (1>=2);           f=3*2; {3}          x=1;
9                           return 6;                                f=6;

That's a lot more like it. We've taken only 9 cycles (75% of the original time) to compute the result, and we've used the multiplier and the arithmetic unit 66% of the time rather than 50%. In fact, when we look at the inner loop in this calculation, we can see that it takes only 3 cycles, which is the theoretical limit for this multiplier since it takes 3 cycles to perform its multiplication. Our 9 cycles are made up of 2 iterations at 3 cycles each, and 3 cycles of pipeline overhead. For higher values of x, we'll approach the theoretical limit of 3 cycles per iteration. That's not bad going.

This only happens because the example was pretty well balanced: each iteration requires 3 instructions from the fetch unit, and 3 cycles of time in the multiplier. As such, it doesn't illustrate the use of register renaming or reservation stations: every cycle, the instruction issued goes directly to the appropriate execution unit, and doesn't spend any time waiting in a reservation station. So, let's modify our processor by adding a faster multiplier unit that takes only two cycles, and giving our fetch unit the ability to fetch two instructions per cycle. We'll give the arithmetic unit and the multiplier two reservation stations each, which we'll denote as 'a0' and 'a1', and 'm0' and 'm1' respectively, and we'll show these explicitly on each cycle.

As values are broadcast on the common data bus, we'll substitute those values into the code in the reservation stations to indicate that the values are now held in the reservation station ready for execution.

Cycle Registers Issue unit           Arithmetic unit               Multiply unit            Writeback
1               f=1;                 ra0:                          rm0:
                f=f*x;               ra1:                          rm1:
                                     exe:                          exe:

2    f=rm0      x=x-1;               ra0: f=1;                     rm0: f=ra0:f*3;
                if (x>=2) goto loop; ra1:                          rm1:
                                     exe: f=1;                     exe:

3    f=rm0,     f=f*x;               ra0: x=3-1;                   rm0: f=1*3;             ra0:f=1;
     x=ra0      x=x-1;               ra1: if (ra0:x>=2) goto loop; rm1:
                                     exe: x=3-1;                   exe: f=1*3; {1}

4    f=rm1,     if (x>=2) goto loop; ra0: x=2-1;                   rm0: f=1*3;             ra0:x=2;
     x=ra0      return f;            ra1: if (2>=2) goto loop;     rm1: f=rm0:f*2
                                     exe: if (2>=2) goto loop;     exe: f=1*3; {2}
5    f=rm1,     if (x>=2) goto loop; ra0: x=2-1;                   rm0:                    rm0:f=3;
     x=ra0      return f;            ra1: goto loop;               rm1: f=3*2;
                                     exe: x=2-1;                   exe: f=3*2; {1}

6    f=rm1                           ra0: if (1>=2) goto loop;     rm0:                    ra0:x=1;
     x=ra0                           ra1: return f;                rm1: f=3*2;
                                     exe: if (1>=2) goto loop;     exe: f=3*2; {2}

7    f=rm1                           ra0:                          rm0:                    rm1:f=6;
                                     ra1: return f;                rm1:
                                     exe: return f;                exe:

Job done, in approximately 7 cycles. A few observations:

  1. There are a fair few cycles there in which the issue logic is shown as empty. In the later cycles, these could potentially be busy issuing the next few instructions.
  2. We have a few fewer cycles, but not that many fewer. Not the 1/3 reduction we'd expect to see given that we've increased the multiplier throughput by that much, and certainly not the 50% reduction that we might hope for given that we've increased the fetch and issue width by that much. We are still limited to an aggregate throughput of 1 instruction per cycle because there's still only a single common data bus.
  3. Tracing execution like this, it looks very, very complex.

The third point there is a rather important one: Tomasulo's Algorithm gives us a fairly straightforward way to describe and implement some very, very complex behaviour. But implementations can still be very complicated.

Other issues that we can probably see from here are branch prediction and exception precision. I've used an idealised branch prediction algorithm which happens to get the correct answer every time and manages to keep the issue unit fully supplied with instructions. Branch prediction becomes incredibly important as issue widths go up: every branch misprediction causes inefficiency proportional to the issue width, since that number of instructions could potentially be issued in every cycle wasted by the branch misprediction.

Exception precision is a more subtle issue. If we supplement our multiplier with a divider, and allow it to raise an exception on a divide by zero error, how would our processor be able to recover from such an exception? By the time a divide instruction is actually executed, instructions following it may have already been issued to the integer arithmetic unit or to other reservation stations in the multiplier / divider. Determining which following instructions to cancel can be difficult, and different implementations take different approaches.

The P6 core, for example, retires instructions in program order after their actual execution, holding un-committed results in a reorder buffer until the instructions are to be committed. Other control schemes may keep instructions in-order in the pipeline, like a scalar machine, until they pass the point at which they may raise exceptions: division by zero, for example, can be detected early, while most memory faults can be detected before the access is attempted.

Log in or registerto write something here or to contact authors.