Preprocessor
A preprocessor is a program that takes text and performs lexical conversions on it. The conversions may include macro substitution, conditional inclusion, and inclusion of other files.
The C programming language has a preprocessor that performs the following transformations:
- Replaces trigraphs with equivalents.
- Concatenates source lines.
- Replaces comments with whitespace.
- Reacts to lines starting with an octothorp (#), performing macro substitution, file inclusion, conditional inclusion, and other transformations.
The use of preprocessors has been getting less common as recent languages provide more abstract features rather than lexical-oriented ones. Indeed, the overuse of the proprecessor might yield quite chaotic code. In designing a new language based on C, Stroustrup introduced features such as inline and templates into C++ in an attempt to make the C preprocessor less relevant. Nevertheless, there is an abundance of installed C code which relies on the preprocessor, so the ability to understand such code is still important.
New languages proposed recently have little or no preprocessor ability. Java has no preprocessor. D, designed as a replacement of C and C++, supports features such as imports, nested functions, versioning, debug statements, etc. that help make it practical to eliminate the preprocessor entirely.
Other preprocessors include m4 and Oracle Pro*C. The m4 preprocessor is general-purpose; Oracle Pro*C converts embedded PL/SQL into C.
Preprocessing can be quite cumbersome in incremental parsing or incremental lexical analysis because changes to preprocessing rules can affect the entire text to be preprocessed.
Most C compilers have a flag which allows generation of post-processed code so that static analysis can be performed on the output if desired.
C Examples
This section goes into some detail about C preprocessor usage. Good programming practice when writing C macros is crucial, particularly in a collaborative setting, so notes on this have been included. Of course, it is possible to abuse these features, but this is not recommended in a production environment. Used responsibly, the C preprocessor can be an excellent optimization, debugging and code documentation tool.
The most common use of the preprocessor is to include another file:
#include <stdio.h>
int main (void)
{
printf("Hello, world!\n");
return 0;
}
The preprocessor replaces the line #include <stdio.h> with the system header file of that name, which facilitates use of the printf() function.
More precisely, the entire text of the file 'stdio.h' is inserted into the file at that point.
This can also be written using double quotes, e.g. #include "stdio.h". The angle brackets were originally used to indicate 'system' include files, and double quotes user-written include files, and it is good practice to retain this distinction. C compilers and programming environments all have a facility which allows the programmer to define where include files can be found. This can be introduced through a command line flag, which can be parameterized using a makefile, so that a different set of include files can be swapped in for different operating systems, for instance.
It is good programming practice to use the compiler parameter to define the include file paths. It is not good programming practice to used relative or absolute file names in #includes. If your source is copied to another system with a different directory structure, this could 'break' your code, requiring numerous edits to get it to compile on the new system.
Conventionally, include files are given a .h extension, and the files they are included in are given the .c extension. However, there is no particular requirement that this be observed. Occasionally one will see files with other extensions included in a .c file, including other .c files.
The #define/#ifdef/#ifndef/#else/#elif/#endif statements can be used for conditional compilation.
#define __WINDOWS__ #ifdef __WINDOWS__ #include <windows.h> #else #include <mac.h> #endif
The first line defines a symbol __WINDOWS__. This could also be introduced from a compiler command line parameter, so that the program could be parameterized in a makefile.
Subsequently, if __WINDOWS__ is defined, the file <windows.h> is include, otherwise <mac.h>.
Note that when a #define takes one argument, the symbol is regarded as implicitly 'true' for the sake of the macro preprocessor. In some preprocessors, it is assigned the value '1'.
A #define can take two arguments, in which case, the second argument is textually substituted for the first. This is conventionally used as part of good programming practice to create symbolic names for constants, e.g.
#define PI 3.14159
instead of hard-coding those numbers throughout one's code.
A #define can be used to create a macrofunction:
#define RADTODEG(_x) ((double)((double)(_x)*((double)57.295736)))
This defines a radians to degrees conversion which can be written subsequently, e.g. RADTODEG(34). This is expanded 'inline', that is the computation 34*57.29... is done at compile time, instead of incurring the overhead of an expensive multiplication at run time. In C++ it is possible to use the inline keyword to indicate to the compiler that a member function be expanded in this fashion, but this is simply a 'suggestion' to the compiler; the compiler has the prerogative to ignore this if it exceeds a certain level of complexity. Not so with a C macrofunction.
Note that C compiler is required to perform the math at compile time if all of the arguments to the macrofunction expand into integers or floating point numbers. It will not do so if any of the arguments is a variable.
Some notes on style: the macro is written as all uppercase to emphasize that it is a macro, not a compiled function. The argument is written with an prefixed underbar. This is because there might be a variable 'x' in the code, and expanding the macro could have unwanted side effects when a literal 'x' appears in the macro preprocessor output.
One of the most subtle and easy to abuse features of the C macropreprocessor is string concatenation. This is a feature of macrofunctions where two arguments can be 'glued' together using two pound signs (in this case the pound signs are written in the body of the macrofunction). This allows two strings to be concatenated in the preprocessed code. This can be used to construct elaborate macros which act much like C++ templates, without many of their benefits.
For instance:
#define MYCASE(_id,_item) \
case _value: \
_id##_##_item=_id;\
break
switch(x) {
MYCASE(widget,23);
}
The line MYCASE(widget,23) gets expanded here into case 23: widget_23=23; break;.
Note that the _ between the two ##s is 'literal' whereas the _id and _item arguments are 'arguments' to the macrofunction.
One stylistic note about this macro is that the semicolon on the last line of the macro definition is omitted so that the macro looks 'natural' when written. It could be included in the macro definition, but then there would be lines in the code without semicolons at the end which would throw off the casual reader.
The macro can be extended over as many lines as required using a backslash escape at the end of the line. The macro ends on the last line which does not end in a backslash
One drawback of multi-line macros is that comments cannot be written in the macro definition in standard C. Hence line-by-line source documentation can't be written in the body of the macro. However, properly used, multi-line macros can greatly conflate the size and complexity of a C program and enhance its readability and maintainability.
The #error directive inserts an error message into the compiler output.
#error "Gosh!"
This prints Gosh! in the compiler output and halts the computation at that point. This is extremely useful if you aren't sure whether a given line is being compiled or not. It is also useful if you have a heavily parameterized body of code and want to make sure a particular #define has been introduced from the makefile, e.g.:
#ifdef _WINDOWS_
... // windows specific code
#elif __MAC__
... // apple specific code
#elif __UNIX__
... // unix specific code
#else
#error "NO OPERATING SYSTEM DEFINED!!!"
#endif
The #warn directive is written identically, and also includes its message in the compiler output, but doesn't halt the compiler.
Then there is the #pragma directive. This is a compiler specific directive which each vendor uses for whatever purpose they wish. For instance, #pragmas are used to allow suppression of specific error messages, manage heap and stack debugging, and so on.
Certain symbols are predefined in ANSI C. Two useful ones are __FILE__ and __LINE__, which expand into the current file and line number. For instance:
// debugging macros so we can pin down message provenance at a glance #ifndef WHERESTR #define WHERESTR "[file %s, line %d] " #endif #ifndef WHEREARG #define WHEREARG __FILE__,__LINE__ #endif printf(WHERESTR ": hey, x=%d\n", WHEREARG,x);
This prints the value of x, preceeded by the file and line number, allowing quick access to which line the message was produced on. Note that the WHERESTR argument is concatenated with the following string.