"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." -- Donald Knuth

"The first principle of optimization is don't." -- Brian Kernighan/Rob Pike, The Practice of Programming

When talking about computer programming, optimization refers to improving a program. This is actually a misnomer, because generating an optimal solution is far too complex a problem. Instead, we have to settle for a solution that is better than the old one. In terms of code generation, better means code that executes faster, although there are cases where code that occupies less space is more valuable (e.g. embedded platforms with limited memory available).

Optimization occurs at two levels. The programmer can manually optimize a program or a compiler program can automatically optimize it. As Knuth, Kernighan, and Pike suggest above, though, manual optimization is best left until a performance problem occurs; after all there is no sense in spending days making your program run seconds faster if nobody is waiting for it. Making the code go faster also has a tendency to make it more complicated, too, so premature optimization is likely to leave you with more buggy and less maintainable code.

When it is necessary to make it faster, the best results usually come from the programmer finding the key performance bottlenecks and fixing those areas. The programmer should first locate those key bottlenecks; with simple programs, it is often quite obvious where to start, with larger programs a profiler program may need to be used. One the key areas are identified, here are some techniques programmers can use:

  • Faster algorithms The classic examples of this are for sorting and searching. Replacing a bubble sort, or any O(n^2) sort for that matter, with a speedy O(n log n) quicksort or a specialized radix sort will make a big difference if many items are being sorted. Likewise, binary searchs, O(log n), leave linear searchs, O(n) in the dust.
  • Lookup tables This technique precalculates time-consuming values and stores them ahead of time in a table that can be used later. An example: if sine and cosine of angles are calculated repeatedly in a loop, but the angles are always rounded to the nearest degree, you could precalculate the sine and cosine for every angle up to 360 degrees. A variation of this technique skips the precalculation part but caches values as they are used.
  • Better data structures This is related to faster algorithms, as data structures and the algorithms used with them are often two sides of the same coin. Examples here are replacing lists or arrays with hash tables or trees to enable faster lookups.
  • Specialized instructions Intel introduced the MMX instructions to speed up cases where the same arithmetic operation is being performed over and over on large quantities of data. A programmer can often speed up certain types of operations by simply using the specialized instructions. Unfortunately, compilers are not very good at detecting the situations where these instructions would be useful, so it is up to the programmer to figure it out.

While it might be a waste of time to hand optimize the code for smaller gains, compilers can automatically do some simple optimizations. This is usually worthwhile since the main trade-off is typically a slower compile. Here are some examples some common compilers optimizations:

  • Constant folding This is one of the simplest forms of optimization. It find expressions that can be calculated at compile time, usually because all the operands are themselves constant, and puts the constant result into the program rather than the code to generate it. An example:
    int len = strlen("some constant string");
    In this case, the compiler knows the string is 20 characters long and it knows what strlen does, so it can just assign 20 to len without calling strlen. Of course, it can only do this for functions with no side effects, and will always return the same value at compile time and at runtime.
  • Instruction ordering Modern processors can execute more than one instruction per clock cycle, but usually only if they use different parts of the CPU. For example, a multiplication might be able to execute at the same time as an address calculation, but two multiplications would have to be done one after the other. Code can be optimized by rearranging the instructions to be close to other instructions that use other parts of the CPU. Of course, instructions that depend on each others results can only be rearranged so much.

    Another aspect of instruction ordering is that of branch timings. Most CPUs execute code faster if a branch is not taken. Thus, it makes sense to make the most common case fall through the branch, and the exceptional cases take the branch. Some more modern compilers can take runtime profile data as input, and use this to best order the branches. In particular, just-in-time (JIT) compilers like Sun's Hotspot are ideally suited to this, as they have complete access to runtime profile data.
  • Register Allocation A CPU will generally have a small amount of very fast internal memory called the register set. Most operations are faster when performed on registers than when performed directly on memory, and there are usually certain operations that can only be performed on registers. Loading important variables into registers can have a huge effect on performance, but since there is a very limited amount of space, trade-offs are necessary. This is especially true with CPU architectures with small register sets like the Intel x86.
  • Aligning data Another quirk of most CPUs is that they access aligned data faster. For example, a variable with an even address might be accessed faster than one with an odd address. Most often, it is related to the word size of the processor: if the CPU has a 32-bit data bus, it can load a whole 32-bit variable at once only if the address is a multiple of four. Some CPUs even require aligned access and throw exceptions if asked to look at unaligned data. Since few languages allow explicit alignment of data, the compiler usually has complete control over the data alignment.
  • Strength reduction This is the process of taking a complicated operation and replacing it with a less complicated operation. A common example is replacing a multiplication with an addition. Most computers can do an addition faster than a multiplication, so it is said to be less strong. The strength reduction compilers can do automatically is usually pretty simple. The following example shows a strength reduction, where a multiplication in a loop is replaced by an addition by introducing a temporary variable to store intermediate results:
    unoptimized version:
    x = someval();
    for (i = 0; i < 100; i++) {
        foo[i] = i * x;
    }
    
    optimized version:
    x = someval();
    tmp = 0;
    for (i = 0; i < 100; i++) {
        foo[i] = tmp;
        tmp += x;
    }
    
  • Loop unrolling The typical way to execute a statement a given number of times is to enclose it in a for loop instead of simply repeating the statement. While this may be more understandable, flexible, and readable, it has a few disadvantages, too. On each iteration, a counter variable must be accessed and tested against the termination condition. In addition, the computer must perform a branch: when it reaches the bottom of the loop, it has to stop running instructions in a nice sequence and go back to the top of the loop. Branches are one of the worst offenders for slowing down CPUs; on older processors, the instructions that were pre-fetched from memory had to be thrown away, newer processors get stalled pipelines and cache misses.

    Loop unrolling simply gets rid of the loop construct and repeats the statement. Compilers are pretty good at unrolling loops. Here is an example where the loop is partially unrolled; every forth iteration, it must check timeToExit(). In the unoptimized example, it will run (i % 4) and branch once for every time through the loop. The unrolled version avoids (i % 4) altogether and branches every forth time through the loop.
    unoptimized version:
    for (i = 0; i < 1000; i++) {
        if ((i % 4) == 0 && timeToExit())
            break;
        doSomething();
    }
    
    optimized version:
    for (i = 0; i < 1000; i+=4) {
        if (timeToExit())
            break;
        doSomething();
        doSomething();
        doSomething();
        doSomething();
    }
    
  • Inlining functions Just like there is overhead with loops that can be optimized away by unrolling, functions have overhead that can be optimized away by inlining. Function inlining is where the call to the function is replaced by the function itself; it saves the cost of setting up the stack frame, the function call itself, the destruction of the stack frame, and the return from the function. It can also allow further optimization, like constant folding and better instruction ordering.

    Inlining in usually very much a speed versus size trade-off because each time a function is inlined, another copy of itself must be added to the final program. Exceptions to this are very small functions where the function body takes up less space than the calling code, and when a function is called only once so the inlined copy is the only one needed.