VLIW - Everything2.com

VLIW, or Very Long Instruction Word, is a relatively new form of computer architecture based on explicitly parallel operations. It continues the trends seen over the last 20 years, from CISC architectures like VAX and x86 to RISC architectures such as SPARC and MIPS, where more and more of the optimization load is taken off the programmer and CPU and placed on the compiler, which acts as an intermediary between the programmer and the system. For example, CISC ISAs often have instructions which perform quite complex operations (string handling operations, complex mathematical operations, and so on), while RISC CPUs only provide a 'bare bones' interface, very suitable for compilers (especially compilers for languages like C), but less so for assembly programmers.

VLIW takes this step even further, by 'bundling' instructions into a single word. This is why they have very long instruction words; each instruction actually performs multiple operations. The trick is that each instruction is independent of each of the other instructions in the bundle. Most VLIW systems have a instruction words that are 128 or 256 bits long, compared to 32 bit instructions in most RISC systems (including pure 64-bit machines like the Alpha). Since each instruction in the bundle is independent of the others, the CPU can execute each of them without having to worry about one instruction stepping on another's toes by reading or writing data that another instruction is also affecting. All super-scalar CPUs have to deal with this problem, and generally they have to devote a large number of transistors to the problem of identifying conflicting instructions and executing them in the correct order. For performance reasons, they also have to be able to delay an instruction and continue running other (non-conflicting) instructions while it's waiting.

As an example, imagine a CPU that has two pipelines, one of which executes multiply instructions, and the other which handles all other operations. If given a sequence of instructions like this (pseudo-assembly):

  MUL r3, r1, r2    ; multiply r1 and r2, put result into r3
  ADD r4, r3, r4    ; add r3 to what's in r4
  ; a bunch of instructions which don't touch r3 follow

the CPU has to not only be able to identify that the addition instruction can't run until the multiply instruction is finished, but it will hopefully be able to continue executing more of the instruction stream until the multiply instruction finishes, at which point it can go back and start to execute the addition. This is particularly important given that multiply instructions often take a significant amount of time to complete, on some machines as much as 100 cycles. While that may not seem like much, given that modern CPUs often execute a billion or more cycles every second, having the entire system stall for 100 cycles for a simple operation like this is a huge waste of resources. More importantly, it completely removes the benefit of having a super-scalar architecture to start with, since in this simple, quite common case, the system is reduced to using a single pipeline at a time. To prevent this exact problem, CPUs have a large amount of hardware to perform out of order execution, which enables it to continue performing operations in the case given above. This, of course, comes at a cost, namely more transistors (and thus more heat and power usage). The above example is somewhat contrived (though the original Pentium did in fact have an architecture quite similar to the above example CPU), but it gives a reasonably accurate example of the problem being solved.

The basic idea of VLIW is to take what is currently being done in hardware (the out of order execution unit), and instead put it into software (the compiler). Not only should the compiler be able to do a better job of finding and working around dependencies, as it has the source code available to it, but it will be faster and cheaper. Faster, because it can be done once, at compile time (rather than each time the program is run), and cheaper, because the CPU designer doesn't have to ship an out of order execution unit in each of the CPUs they ship; the saved space can be used for larger caches or more ALUs, or the savings in transistor counts, power, and heat can be enjoyed by putting more powerful CPUs into smaller devices.

While VLIW allows, in theory, for extremely high efficiency, there are some downsides. Possibly the biggest is that it requires the compiler to be very intelligent; much more intelligent than the CISC and RISC designs that preceded it. Since the whole design of VLIW assumes the compiler is smart enough to handle all of the instruction scheduling, if it turns out the compiler is not up to the task (which most aren't), the performance of the chip suffers badly. Another is that it is extremely difficult to hand-code assembly on these systems, since their whole design is based around executing compiled code. While most people don't write assembly these days, it is still necessary for maximum performance, especially since most compilers, even today, don't handle VLIW systems very well. GCC is somewhat well known for handling VLIW systems poorly, having been originally designed for machines like the 68k and VAX; the resulting impact on the performance of Linux and BSD systems on VLIW machines like the IA-64 is quite substantial.

Currently available VLIW chips include the Philips-Magnavox TriMedia chip, used in set top boxes, the Intel/HP Itanium, and Transmeta's Crusoe. Intel refers to the IA-64 as being EPIC, for explicitly parallel instruction computing, rather than VLIW, but the basic concept is the same. I would welcome information about other VLIW systems available today, as those are the only ones I know of (and Crusoe is only VLIW internally, it provide an x86 ISA, so it doesn't really count).

epic	modifying IP/PC instead of using "JMP"	abbr	Transmeta
superscalar	MOVE processor	Atomically	simultaneous multithreading
VelociTI	Computer Architecture	AMD Hammer	similarities in language and computer processing
OISC	digital signal processor	codemorphing	superpipelined
DEC Alpha	tms320c6211	SHARC	McKinley
The Very Long Night of Londo Mollari	Falling asleep during porno films	System V	x86-64