A preprocessor is a tool to process something in preparation for the execution of another, given processor.

In the context of software systems, preprocessing is a very common technique. It is used to extend the power or range of applications of an existing data processing system - without having to modify that system.

This is a big deal. Software is abstract and complex stuff. Usually, it is easy enough to describe what individual functions or features are supposed to do; the challenge is to have a system in which it is still clear how exactly all of its functions and features work in arbitrary combination. The number of combinations is of course much bigger than the number of features (combinatorial explosion), which doesn't hurt if they are independent, but often there are all kinds of subtle dependencies between them. This can make it extremely hard to understand how the system works in full detail.

Not ending up with a mess is important to the users, but also to the software's developers. If they ever want to change the system without introducing bugs (malfunctions), they must have a very clear picture of how the system works. The only way to do this is to make the system consist of smaller subsystems that each have a clear purpose and design, and combine them in a way that is itself a clear design. But keeping systems well-organized as they change is difficult and it only pays off in the long run, which is why there is so much bloatware in the world, that combines heaps of features, but the internals of which have developed into a steaming pile of spaghetti code.

Another important issue for software is compatibility. Changes to how software works come at a cost to users, who now have to relearn how it is done. What is worse, the old versions will continue to be around, so even new users, who will inevitably Google up documentation for old versions, will have to learn about the differences between versions. So when developers go in and modify software they should still make it seem as if nothing has changed, until users want to use its new features. This is a fairly crippling requirement.

In short, when you are a developer of software people know well and are happy with, but you need additional capabilities, you only want to modify that software as the very last resort. If at all possible, you want to leave it untouched, and write a new piece of software that cooperates with it. A preprocessor is an example of such software.

One area in which preprocessors are very popular is in the area of programming languages. A programming language is a notation to describe in human-readable form what a piece of software should do, such that the resulting descriptions (the source code) can be automatically translated into executable programs. A compiler is a program that performs this automatic translation.

Since programming languages exist for the purpose of convenience, they tend to have a lot of redundancy: the same thing can be written in many different ways. For instance, in the C programming language, to increase the value of a variable i by one, we can write ++i; but (for modern machines or modern C compilers) this is exactly the same as writing i += 1, which in turn is just a shorthand for i = i + 1. So the notations ++ and += are examples of syntactic sugar: we can define their meaning in terms of other C constructs. A smart developer of compilers will take advantage of this by creating one or more preprocessors that actually eliminate syntactic sugar by replacing it with equivalent notations. This way, the actual translator can be much simpler.

Redundancies such as ++ and += were present in C from the start, but C is still among the simplest of programming languages. In particular, it didn't originally have much support for large-scale software construction.

A C program consists of a set of function and variable definitions. Each definition starts with the declaration, which states the name of the variable or function and the type of its value, and for a function, the type of each argument. For instance, a function to compute the number of visitors to a website on a given day might be declared like this:


  int nr_of_visitors(website w, day d)

Declarations can appear in isolation, but there must be one actual definition. For a variable, its definition can assign a value to it. For a function, the definition provides the function body: C statements that perform the desired computations, usually by calling lots of other functions. A simple C program contains thousands of function calls. And developers easily make mistakes when writing them: they put the arguments in the wrong order, or they provide arguments of the wrong type. Sometimes definitions are changed, e.g. an extra argument is added, and all calls to the function become invalid. It is essential to have a tool that automatically cross-checks the consistency of all function calls and uses of variables with their declarations. This is called type checking and every C compiler does it as a first step.

But developers rarely write their programs from scratch; as much as possible they reuse existing libraries, usually written by other developers. The source code for these libraries isn't usually available, but we still want to perform type checking. In the case of C, this means that all variable and function declarations for a library must be available, even when the rest of the source code is not. But the C compiler only works on a single file of source code at a time.

Following the principle stated above of never changing a working program if it can be avoided, this additional capability was provided by creating a preprocessor. Among C software developers, which the previous authors in this node clearly are, it is known simply as the preprocessor.

The magic in this design is that the preprocessor is an extremely stupid tool. It doesn't even know what a declaration is; it doesn't know anything about C at all. This is by design: make stupid tools that only try to do a simple job, and you will not only understand the tool better, you will also be able to reuse it in other systems that also need to have that job done. RPGeek informs me that it works very well with FORTRAN instead of C.

So what does the C preprocessor do? It implements a mechanism for combining text files into larger text files. How to combine them is specified by inserting special keywords into the original files:

  • #include "somefile" is replaced by the preprocessor with the literal contents of the file somefile
  • #define foo bar causes the preprocessor to literally replace every further occurrence of the string foo with the string bar
  • #ifdef foo causes the preprocessor to skip the following text if there is no previous #define foo something
  • #if foo > 2 causes the preprocessor to skip the following text if foo is not defined or not greater than 2; < and = can be used as well
  • #endif cancels the skipping for the previous #if or #ifdef
That is basically it. (There is more syntax but it isn't relevant now.)

How can this be used to publish declarations in C programs?

Well, it is up to the author. For every file of C code, the author must provide a separate file that consists of nothing but the declarations of all the variables and functions defined. These files, known as header files, are published with the software. All other source code that employs the variables and functions can then now #include the header files that declare them; the C preprocessor combines them into a single file, allowing the C compiler to perform its type checking job. The C compiler never needs to know that the type checking information was actually assembled from completely different files, usually written by different authors; and the preprocessor doesn't even have to know it is dealing with C. Best of all, the assembly of type checking information isn't a mysterious process hidden within the guts of a tool, but is completely transparent to the software developer who depends on it.

A great design: simple and transparent. A prime example of the philosophy of building systems by combining tools that are each as stupid as possible. But there are some drawbacks.

One issue is that the correctness of header files is up to the author. Yet another preprocessor, makedepend, was developed to address this. It knows enough about C to be able to extract all declarations from C code, allowing header files to be generated automatically.

Another issue is inherent in the design. Every #define is potentially disastrous, since it can inadvertently or intentionally redefine the names of variables and functions. When this happens, the preprocessor will never warn you about it, since it does not know what a name is; but neither will the C type checker, since it doesn't even know that any preprocessing happened. makedepend might also get confused.

That being said, the preprocessor has been a life saver for C. The #if feature even allows programs to work across incompatible versions of libraries. C has spread to many different platforms, and its libraries have continued to develop on most of them, becoming incompatible in the process. Consequently, program that have to run on a variety of platforms are invariably littered with hundreds of #if clauses to work around the differences between the libraries available on those platforms. A whole suite of preprocessors, autoconf and friends, has been developed to automatically generate such clauses and set the right values for the #ifs as much as possible.

The C preprocessor executable is called cpp, the C compiler cc, and the linker, which assembles the results of compilation into a single program, is called ld. The GNU C compiler, gcc, combines these steps into a single program, but still allows them to be executed separately.

Another popular version of the C preprocessor is called m4. It is a standard component of the "autoconf tools" mentioned above.

The story of C preprocessing doesn't end here.

What cpp does is clever, but it is only the bare minimum required to support code reuse. A natural extension of that technique, known as object orientation, designed in Norway and Denmark well before C existed, kept growing in popularity; eventually, a Dane working in the birthplace of C decided to add object orientation to C. The result is known as C++. Naturally, C++ was originally implemented as a preprocessor that converted all object oriented constructs into standard C source code. (This is no longer true today.)

Later on, an equally fundamental enhancement to C++, the STL, was defined in terms of preprocessing as well. Programming with the STL is nothing like programming in C; it is a completely different way of expressing solutions.

In summary, the C programming language and its successors showcase how simple yet powerful the technique of preprocessing can be.