This writeup is about the sneaky things compilers do.

There's no general solution to the halting problem, but there are certain special cases where you can sneak up on it and give it a wedgie, at least. When you're writing a compiler, this becomes interesting, because it lets you generate code that does certain things a lot less often than it might otherwise need to. Consider the following example:

    int i = 0;

    if ( i ) {
        /* Do something improbable */
    }

Assume that nothing has been elided there. i is a variable, but when execution reaches the if, there's only one value that i can possibly have. At that point, it is for all intents and purposes1 a constant with a known value. Therefore, you can skip compiling that conditional (thus saving a cycle or two at runtime) and everything in the block dependent on it (thus shrinking the code a bit). That example is nothing much, but it's a very simple example. Consider:

    for ( i = 0; i < 0x7fffffff; ++i ) {
        a[ i ].foo = i;
        a[ i ].bar = "baz";
        a[ i ].baz = strlen( a[ i ].bar );
        a[ i ].whatever = i & 1;
    }

A big chunk of what that loop is doing is pointer arithmetic and dereferencing pointers. A smart compiler should notice that a[i] appears five times in the body of the loop, during which nobody changes the value of i. The code it generates should add i to a once and then hang on to that. It should also notice that the argument being passed to strlen is always the same. If it knows with certainty what strlen does, it can get rid of 2,147,483,646 redundant calls to strlen — or better yet, it can generate code that never calls strlen at all: The length of the string "baz" is known to the compiler up front. The example is contrived, but the principle is valid. Naturally, this all becomes more problematic in late-binding languages where "strlen" might refer to damn near anything by the time you get there. That's the "certainty" part. One of the (lesser) reasons statically-typed languages can be more efficient than their free-and-easy brethren is that there's a lot more certainty for the compiler to work with. It's also generally true that function calls, especially into libraries, make life difficult because of potential side effects and unpredictable return values.

You might say that data flow analysis is the subtle art of recognizing the obvious when it bites you in the ass. If you can teach a compiler to do that, you're doing okay.




1 There is no such thing as an "intensive" purpose. Those who imagine otherwise will be shot.

Data flow analysis is a method of approximating a program's runtime behaviour at compile time, usually used by the compiler to check for programming errors or to find potential optimizations. The analysis is based on the control flow graph of the program (or function); the idea is to recursively annotate each node with elements of a lattice, representing (inexact) facts about the program's execution that can be determined statically. In a forwards analysis, we define a function combining the states of the predecessors of a node, and another function to find the state at a node in terms of the program statement it represents and the combined states of its predecessors. We also need to define the state at the entry node of the graph. In a backwards analysis, we do pretty much the same thing, but in the other direction, starting from the control flow graph's exit node, and combining each node's successors' states.

Suppose, for a minute, that you're a compiler. You have a pretty tedious life, and want to do some analysis to spice things up. What kind of facts might you be interested in? Well, you might want to keep track of the possible values a variable may take; if you see that it turns out to be constant, you might want to inline its value wherever it is used. Or, you might want to perform a backwards analysis keeping track of live variables, which will certainly be read before they are next updated. If you discover that a variable is not live in a particular block of code, you need not keep memory hanging around for it.

The combining function is written as JOIN(v), where v is a node of the control flow graph, corresponding to a statement of the program in question. The state function at a particular node v is written 〚v〛, but not all fonts have those double-square-bracket characters, so I'll write [[v]] for the purposes of this node.

JOIN(v) is going to be either the meet () or join () of the lattice you're working in, applied to the states of v's predecessors (or successors, in a backwards analysis; some people refer to this function as SPLIT in that case). It's usually pretty obvious which one to use. If a given property must hold at any predecessor to hold at this node, then use ∧; conversely, in a may analysis a property need only hold at one predecessor to propagate to this node, and ∨ is your poison.

A straightforward example

I mentioned earlier analysing the possible values of variables, but let's consider a slightly simpler case: keeping track of which variables have definitely been initialized. This is a forwards analysis, since the fact at each node depends on the facts of the previous nodes in the program. The lattice we're working is the power set of the set of all variables in the program; that is, the fact at each node is a subset of the program's variables. The necessary definitions pretty much write themselves. At the start of the program, no variables have been initialized:

[[ entry ]] := {}

This is a must analysis, since a variable must be initialized on all paths to a given statement to be guaranteed to be initialized at that statement. So, we need to take the intersection (the meet of this lattice) of the sets of variables known to be defined at a node's predecessors:

JOIN(v) := ⋂{ [[w]] | wpred(v) }

At an assignment to a variable, we add that variable to the set of identifiers. So if v corresponds to a statement id = E; for some identifier id and some expression E, we have:

[[ v ]] := JOIN(v) ∪ { id }

At any other statement, we just propagate the set of initialized variables forward:

[[ v ]] := JOIN(v)

And we're done! Let's take a look at a ridiculous example program, and its beautiful ASCII-art control flow graph:

int x;
int y = 17;
int z = 0;
while (y > 0)
{
    z = z + (y * x);
    x = 3;
    y = y - 1;
}

                               +-------+
                               | entry |
                               +-------+
                                   |
                                  \|/
+-----------+  +------------+  +-------+
| int z = 0 |<-| int y = 17 |<-| int x |
+-----------+  +------------+  +-------+
  |    +--------------------------------------------+
 \|/  \|/                                           |
+--------+  +-----------------+  +-------+  +-----------+
|  y > 0 |->| z = z + (y * x) |->| x = 3 |->| y = y - 1 |
+--------+  +-----------------+  +-------+  +-----------+
    |
   \|/
+------+
| exit |
+------+

The naïve method of finding the fixed point of this analysis is to initialize every node's data to {}, then repeatedly evaluate [[ ]] at each node until nothing changes. Whatever technique is used, we converge to:

int x;                  // {}
int y = 17;             // {y}
int z = 0;              // {y, z}
while (y > 0)           // {y, z}
{
    z = z + (y * x);    // {y, z}
    x = 3;              // {x, y, z}
    y = y - 1;          // {x, y, z}
}

(Pretty much the only part of this analysis worth pointing out is that, because we're taking an intersection in JOIN, [[ y > 0 ]] does not contain x, despite x being in the set of variables certainly defined at one of its predecessors. This is correct: we are interested in whether a variable must be defined at each node, not whether it may be defined.) With these annotations, your compiler can now raise a warning that x is used uninitialized in the loop's body, and you can fix the bug!

Log in or register to write something here or to contact authors.