Most modern CPUs include some sort of SIMD instruction set; MMX on x86 machines since the later Pentiums, SSE on Pentium III and Athlon XP, VIS on SPARC, and AltiVec on PowerPC. But in general it is very hard to use these instructions; a programmer must usually resort to writing assembly code.

Most algorithms start out their lives as serial code. That is to say, they do one instruction, then another, and then yet another, and so on. Someone, be it the compiler or the programmer, has to convert the code into a form suitable for using parallel instructions, wherein more than once piece of data is handled by a single instruction. For some ideas on this, read the MMX Optimization Tutorial, which gives a basic overview of the process.

Vectorization only really works on code that is compiled natively; things executed inside an interpreter are unlikely to benefit, unless the interpreter is very smart. Languages like C, C++, and Fortran are most likely to benefit, and historically most of the work done in vectorization, especially automatic vectorization, has been on these languages.

Here's a simple example that shows you how vectorization works, and why it's useful.

for(unsigned int j = 0; j != length; j++)
     z[j] = x[j] ^ y[j];

In this example, x and y are both arrays of bytes, and both have length elements. A (wildy oversimplified) analysis of this code would suggest it would take about 4*length instructions, ignoring loop overhead (which would actually be fairly high in this case, but we'll ignore it for simplicity). The multiplication by 4 is for "load, load, xor, store" for each time through the loop. If x, y, or z overlap each other in any way, then that's about as best we can do. However, if the compiler can know that this is not the case (either through analysis of the code, or by the programmer telling it somehow, like with C's restrict), then it can use vectorization instead. This is actually one of the reasons that Fortran is often faster than C/C++ at numerical code - the only time two arrays can overlap, or share the same address, is if you explicitly tell the compiler that that is the case. On the other hand, with C/C++, the only time the compiler can assume this is true is if you tell it so..

Now consider using MMX instructions. We can do 8 bytes at once, executing MOVQ instructions to load 8 bytes of x and y in one shot. We proceed with a PXOR instruction, then finally do another MOVD to put the result in the correct spot in z. Even if the MMX instructions are somewhat slower, we still get major gains.

I really don't know what the speed of MMX is, but I'd guess on recent machines, most operations are only a few cycles. In particular, I'd be willing to bet that PXOR is a one-cycle instruction, and that MOVD is not more than twice as slow as moving a single byte to/from the L1 cache. We do 3 MOVD and a single PXOR per loop; so maybe 3*2+1=7 cycles per 8 bytes. That's a lot cheaper than the 4 cycles per byte of the first loop.

The Intel C compiler does the above conversion automatically. Currently, floating point data is much better supported by automatic vectorizers, but lately I have been talking to an engineer at Intel to provide some feedback about integer vectorization. I can't remember for sure, but I think someone might be working to add such code to GNU GCC as well.

Keep in mind that my timings are totally made up (though they should be reasonably accurate reflections of real world timing), and are just for example purposes anyway. I also totally ignore the effects of a superscalar processor, because that would be really hard, and anyway depend intimately on the exact CPU being used.

Hope this has been informative.

†: Thanks to ariels for reminding me of this restriction.