Problems began to arise with VLIW, EPIC, and VelociTI architectures when compilers were not able to find enough instruction-level parallelism anywhere but tight loops. Solution: If you can't find enough ILP in one thread, run more threads.
Most programs severely underutilize a CPU's functional units because of structural hazards. Some architectures maintain more than one execution state (program counter, page table, and register contents) and steal instructions from other threads whenever there's a pipeline bubble (such as underutilized functional units, branch delay, RAM latency, etc.). Surprisingly, Rollo's writeup in branch prediction has a shred of truth: you can fill a load or branch's delay slots with another process.
Compaq's latest Alpha processors do this. IBM's Power4 processor does something similar called "chip multiprocessing" that's a bit simpler but involves static (not dynamic) allocation of functional units to processes.