The term superscalar refers to an instruction set processor capable of sustaining execution of instructions at a rate greater than one per clock cycle.

The simple, or 'scalar' pipelined machines that we all used up till the mid 90s (up to and including the Intel 486, the Motorla 68030; plus all our ARMs, SPARCs before the SuperSPARC, and MIPS machines before the MIPS R3000s) or so processed instructions in a manner that corresponds with the intuitive understanding of the term pipeline, with only enough room in any part of the pipeline for one instruction at a time. An adder can only add one pair of numbers at a time, after all.

As such, these machines could only approach an ideal maximum throughput, the scalar limit, of one instruction per clock cycle (each instruction spending only one cycle in each pipeline stage, with no wasted work). For short bursts of perfectly tuned code, most CPUs could easily execute one instruction per clock cycle, but ran out of steam as soon as they had to hit a branch instruction.

Hence a "superscalar" processor is merely one which exceeds this scalar limit.

In modern times, the term superscalar has gained a slightly more specific definition in popular usage, since we have things which confuse the issue somewhat. Namely: SIMD, MIMD, VLIW (and EPIC, if you really must draw the distinction), and asynchronous processors.

SIMD and VLIW are instruction set architecture concepts which encode more than one operation into their instructions. The operations are largely independent, and thus a machine with sufficient functional units to execute all the elements of that instruction at the same time can easily perform the same amount of work per clock cycle as a scalar processor would do in more than one clock cycle, thus achieving a "superscalar" performance.

At this point, it becomes important to pay attention to how an "instruction" is defined. A SIMD or VLIW instruction can encode more than one operation or function in an instruction (or instruction packet, as VLIW architectures frequently refer to their instructions), but those operations still form a single executable unit. The entire instruction is either executed or not, and the semantics of the instruction set do not allow data to move between the different operations of a VLIW instruction. So a SIMD or VLIW machine do not necessarily execute more than one "instruction" per clock, just more than one "operation".

Of course, that's not to say that SIMD or VLIW machines can't execute more than one instruction or instruction packet per clock cycle. The Pentium 4's SSE2 unit and the G5's AltiVec unit are capable of executing more than one SIMD instruction per clock cycle, and in this sense they are both superscalar and SIMD. One doesn't necessarily exclude the other.

MIMD (or SMT, or multi-core) processors may execute more than one instruction per clock cycle, but these instructions may come from entirely different sequential instruction streams, and so although aggregate throughput is more than a single instruction per clock cycle, we don't necessarily consider these as superscalar. Of course, most implementations of these technologies happen to be superscalar as well as being SMT. So we end up with a refined definition that says something like:

A superscalar processor is one capable of executing a single thread of sequential and atomic instructions at a rate greater than one per clock cycle.

Now that we've got that all sorted out and clear in our minds, asynchronous processors come along and blow us clean out of the water by not even having a clock in the first place, thus executing a completely undefined number of instructions per clock cycle.

It's a funny old world.

A superscaler processor is one that accepts more than one instruction into the pipeline per clock cycle.

The Pentium 4 processor accepts up to 6 instructions per clock cycle, while emitting the results of 3 instructions per clock cycle. Why it emits fewer than it gets per clock and still performs well is left as an exercise for the reader.

This is different from a VLIW processor. Very Long Instruction Word processors accept multiple instructions per cycle, but they must be formed in groups, and certian 'slots' of the very long instruction can only perform certian operations - so a series of add instructions may only use two of, say, the four execution slots available per long instruction. This processor design relies on compiler intelligence to parallelize code, whereas a superscaler processor parallelizes the code dynamically, requiring no special compilation.

Log in or register to write something here or to contact authors.